3  The Language of Simulation

A probability model of a random phenomenon consists of a sample space of possible outcomes, associated events and random variables, and a probability measure which encapsulates the model assumptions and determines probabilities of events and distributions of random variables. We will study strategies for computing probabilities, distributions, expected values, and more, but in many situations explicit computation is difficult. Simulation provides a powerful tool for working with probability models and solving complex problems.

Simulation involves using a probability model to artificially recreate a random phenomenon, usually using a computer. Given a probability model, we can simulate outcomes, occurrences of events, and values of random variables. Simulation can be used to approximate probabilities of events, distributions of random variables, expected values, and more.

Recall that the probability of an event can be interpreted as a long run relative frequency. Therefore the probability of an event can be approximated by simulating—according to the assumptions of the probability model—the random phenomenon a large number of times and computing the relative frequency of repetitions on which the event occurs.

Likewise, the expected value of a random variable can be interpreted as its long run average value, and can be approximated by simulating—according to the assumptions of the probability model—the random phenomenon a large number of times and computing the average value of the random variable over the simulated repetitions.

In general, a simulation involves the following steps.

  1. Set up. Define1 a probability space, and related random variables and events. The probability measure encodes all the assumptions of the model, but the probability measure is often only specified indirectly.
  2. Simulate. Simulate outcomes, occurrences of events, and values of random variables.
  3. Summarize. Summarize simulation output in plots and summary statistics (e.g., relative frequencies, averages, standard deviations, correlations) to describe and approximate probabilities, distributions, expected values, and more.
  4. Sensitivity analysis. Investigate how results respond to changes in assumptions or parameters of the simulation.

You might ask: if we have access to the probability measure, then why do we need simulation to approximate probabilities? Can’t we just compute them? Remember that the probability measure is often only specified indirectly, e.g. “flip a fair coin ten times and count the number of heads”. In most situations the probability measure does not provide an explicit formula for computing the probability of any particular event. And in many cases, it is impossible to enumerate all possible outcomes.

For example, a probabilistic model of a particular Atlantic Hurricane does not provide a mathematical formula for computing the probability that the hurricane makes landfall in the U.S. Nor does the model provide a comprehensive list of the uncountably many possible paths of the hurricane. Rather, the model reflects a set of assumptions under which possible paths can be simulated to approximate probabilities of events of interest like those depicted in Figure 3.1.

In this chapter we will use simulation to investigate some of the problems introduced in the previous chapter. We can solve many of these problems without simulation. But we like to introduce simulation via simple examples where we know the answers so we can get comfortable with the simulation process and easily check that it works.

3.1 Tactile simulation: Boxes and spinners

While we generally use technology to conduct large scale simulations, it is helpful to first consider how to conduct a simulation by hand using physical objects like coins, dice, cards, or spinners.

Many random phenomena can be represented in terms of a “box model2

  • Imagine a box containing “tickets” with labels. Examples include:
    • Fair coin flip. 2 tickets: 1 labeled H and 1 labeled T
    • Free throw attempt of a 90% free throw shooter. 10 tickets: 9 labeled “make” and 1 labeled “miss”
    • Card shuffling. 52 cards: each card with a pair of labels (face value, suit).
  • The tickets are shuffled in the box and some number are drawn out, either with replacement or without replacement of the tickets before the next draw3.
  • In some cases, the order in which the tickets are drawn matters; in other cases the order is irrelevant. For example,
    • Dealing a 5 card poker hand: Select 5 cards without replacement, order does not matter
    • Random digit dialing: Select 4 cards with replacement from a box with tickets labeled 0 through 9 to represent the last 4 digits of a randomly selected phone number with a particular area code and exchange; order matters, e.g., 805-555-1212 is a different outcome than 805-555-2121.
  • Then something is done with the tickets to determine which events of interest have occurred or to measure random variables. For example, you might flip a coin 10 times (by drawing from the H/T box 10 times with replacement) and check if there were at least 3 H in a row or count the number of H.

If the draws are made with replacement from a single box, we can think of a single circular “spinner” instead of a box, spun multiple times. For example:

  • Fair coin flip. Spinner with half of the area corresponding to H and half T
  • Free throw attempt of a 90% free throw shooter. Spinner with 90% of the area corresponding to “make” and 10% “miss”.

Example 3.1 Consider a box model for rolling a four-sided die.

  1. Set up a box model for simulating a roll of each of the following dice.
    1. a fair four-sided die
    2. the weighted die in Example 2.29
    3. the weighted die in Example 2.30.
  2. Roll a die from the previous part twice and let \(X\) be the sum of two rolls and let \(Y\) be the larger of the two rolls (or the common value if a tie). Explain how you would use the box models from the previous part to simulate a realization of \((X, Y)\).
  3. Let \(A\) be the event that the sum is 3, and let \(\textrm{I}_A\) be the indicator random variable. Explain how you would use the box models to simulate a realization of \(A\) and \(I_A\).
  4. In the previous parts, could you use spinners instead?
  5. Use a fair four-sided die (or a box or a spinner) to simulate 10 repetitions. (Yes, really do it.) For each repetition, record the results of the first and second rolls (or draws or spins) and the simulated values of \(X\), \(Y\), \(A\), and \(I_A\).

Solution 3.1.

  1. Each box would have the numbers 1, 2, 3, 4 written on some number of tickets.
    1. 4 tickets: labeled 1, 2, 3, 4
    2. 10 tickets: 1 labeled 1, 2 labeled 2, 3 labeled 3, 4 labeled 4
    3. 15 tickets: 4 labeled 1, 6 labeled 2, 3 labeled 3, 2 labeled 4
  2. Use the appropriate box for the die of interest. Draw two tickets with replacement (that is, put the first ticket back in the box and reshuffle before drawing the second ticket). Let \(X\) be the sum of the two numbers drawn and \(Y\) the larger of the two numbers drawn.
  3. Similar to the previous part, but event \(A\) either happens or not, so a realization of it is true or false instead of a number. If the first roll is 3 then \(A\) is true; otherwise \(A\) is false. The indicator random variable \(\textrm{I}_A\) just translates true to 1 and false to 0.
  4. We could replace the boxes in part 1 with the spinners in Figure 2.9. Spin the appropriate spinner twice and let \(X\) be the sum of the two numbers spun and \(Y\) the larger of the two numbers spun, etc.
  5. See Table 3.1. Results vary naturally so your simulation results will be different, but the same ideas apply.
Table 3.1: Results of 10 repetitions of two rolls of a fair four-sided die. X is the sum of the two rolls, Y is the maximum, and A is the event that the first roll is a 3.
Repetition First roll Second roll X Y Event A occurs? I[A]
1 2 1 3 2 False 0
2 1 1 2 1 False 0
3 3 3 6 3 True 1
4 4 3 7 4 False 0
5 3 2 5 3 True 1
6 3 4 7 4 True 1
7 2 3 5 3 False 0
8 2 4 6 4 False 0
9 1 2 3 2 False 0
10 3 4 7 4 True 1
Warning

The tables of outcomes in this chapter are a different type than those in Chapter 2. Many tables in Chapter 2, such as Table 2.7, represent the sample space: there is one and only one row for each distinct possible outcome, and each outcome (row) has a corresponding theoretical probability. On the other hand, tables of simulation output—like Table 3.1 and many tables in this chapter—contain one row for each repetition of the simulation. A particular outcome might be repeated several times in a simulation and thus might appear in several rows of the table of simulation output. For example, the outcome (3, 4) appears only once in Table 2.7, but twice in Table 3.1 (in repetitions 6 and 10). We will discuss how to approximate the theoretical probability of a particular outcome using the relative frequency of repetitions of that outcome in the simulation.

Note that we are able to simulate outcomes of the rolls and values of \(X\) and \(Y\) without defining the probability space in detail. That is, we do not need to list all the possible outcomes and events and their probabilities. Instead, the probability space is defined implicitly via the specification to “roll a fair four-sided die twice” or “draw two tickets with replacement from a box with four tickets labeled 1, 2, 3, 4” or “spin the spinner on the left of Figure 2.9 twice”. The random variables are defined by what is being measured for each outcome, the sum (\(X\)) and the max (\(Y\)) of the two draws or spins.

A simulation usually involves many repetitions. When conducting a simulation4 it is important to distinguish between what entails (1) one repetition of the simulation and its output, and (2) the simulation itself and output from many repetitions. When describing a simulation, refrain from making vague statements like “repeat this” or “do it again”, because “this” or “it” could refer to different elements of the simulation. In the dice example, (1) rolling a die is repeated to generate a single \((X, Y)\) pair, and (2) the process of generating \((X, Y)\) pairs is repeated to obtain the simulation results. That is, a single repetition involves an ordered pair of die rolls, resulting in an outcome \(\omega\), and the values of the sum \(X(\omega)\) and max \(Y(\omega)\) are computed for the outcome \(\omega\). The process described in the previous sentence is repeated many times to generate many outcomes and \((X, Y)\) pairs according to the probability model.

Think of simulation results being organized in a table like Table 3.1, where each row corresponds to a repetition of the simulation, resulting in a possible outcome of the random phenomenon, and each column corresponds to a different random variable or event. Remember that indicators are the bridge between events and random variables. On each repetition of the simulation an event either occurs or not; we could record the occurrence of an event as “true” or “false”, or we could record the value of the corresponding indicator random variable, 1 or 0.

Figure 3.2 displays two plots summarizing the results in Table 3.1. Each dot represents the results of one repetition. Figure 3.2 (a) displays the simulated \((X, Y)\) pairs, Figure 3.2 (b) displays the simulated values of \(X\) alon along with their frequencies. While this simulation only consists of 10 repetitions, a larger scale simulation would follow the same process.

(a) Simulated \((X, Y)\) pairs
(b) Simulated values of \(X\)
Figure 3.2: Dot plots of the simulation results in Table 3.1 of 10 repetitions of two rolls of a fair four-sided die, where \(X\) is the sum and \(Y\) is the larger (or common value if a tie) of the two rolls.

Example 3.2 Recall Example 2.2. Consider the matching problem with \(n=4\). Label the objects 1, 2, 3, 4, and the spots 1, 2, 3, 4, with spot 1 the correct spot for object 1, etc. Let \(X\) be the number of objects that are placed in the correct spot, and let \(C\) be the event that at least one object is placed in the correct spot. Describe how you would use a box model to simulate a single realization of \(X\) and of \(C\). Could you use a spinner instead?

Solution 3.2. Use a box with 4 tickets, labeled 1, 2, 3, 4. Shuffle the tickets and draw all 4 without replacement and record the tickets drawn in order. Let \(X\) be the number of tickets that match their spot in order. For example, if the tickets are drawn in the order 2431 then the realized value of \(X\) is 1 since only ticket 3 matches its spot in the order. Since \(C=\{X \ge 1\}\), event \(C\) occurs if \(X\ge 1\) and does not occur if \(X=0\). We could record the realization of event \(C\) as “true” or “false”. We could also record the realization of \(\textrm{I}_C\), the indicator random variable for event \(C\), as 1 if \(C\) occurs and 0 if \(C\) does not occur.

Since the tickets are drawn without replacement, we could not simply spin a single spinner, like the one in Figure 2.9 (a), four times. If we really wanted to use the spinner in Figure 2.9 (a), we couldn’t just spin it four times; we would sometimes have to discard spins and try again. For example, suppose the first spin results in 2; then if the second spin results in 2 we would need to discard it and try again. So we would usually need more than four spins to obtain a valid outcome. If we wanted to guarantee a valid outcome in only four spins, we would need a collection of spinners, and which one we use would depend on the results of previous spins. For example, if the first spin results in 2, then we would need to spin a spinner that only lands on 1, 3, 4; if the second spin results in 3, then we would need to spin a spinner that lands only on 1 and 4.

3.1.1 Exercises

Exercise 3.1 Recall the birthday problem of Example 2.48. Let \(B\) be the event that at least two people in a group of \(n\) share a birthday.

Describe in detail you could use physical objects (coins, cards, spinners, etc) to simulate by hand a single realization of \(B\) (that is, simulate whether or not \(B\) occurs).

Exercise 3.2 Maya is a basketball player who makes 40% of her three point field goal attempts. Suppose that at the end of every practice session, she attempts three pointers until she makes one and then stops. Let \(X\) be the total number of shots she attempts in a practice session. Assume shot attempts are independent, each with probability of 0.4 of being successful. Describe in detail you could use physical objects (coins, cards, spinners, etc) to simulate by hand a single value of \(X\).

Exercise 3.3 The latest series of collectible Lego Minifigures contains 3 different Minifigure prizes (labeled 1, 2, 3). Each package contains a single unknown prize. Suppose we only buy 3 packages and we consider as our sample space outcome the results of just these 3 packages (prize in package 1, prize in package 2, prize in package 3). For example, 323 (or (3, 2, 3)) represents prize 3 in the first package, prize 2 in the second package, prize 3 in the third package. Let \(X\) be the number of distinct prizes obtained in these 3 packages. Let \(Y\) be the number of these 3 packages that contain prize 1.

Describe in detail you could use physical objects (coins, cards, spinners, etc) to simulate by hand a single \((X, Y)\) pair.

3.2 Tactile simulation: Meeting problem

Now we’ll consider tactile simulation for a continuous sample space. Throughout this section we’ll consider the two-person meeting problem. We’ll continue to measure arrival times in minutes after noon, including fractions of a minute, so that arrival times take values on a continuous scale.

3.2.1 A uniform distribution

Example 3.3 Assume that Regina and Cady each arrive at a time uniformly at random in the continuous time interval between noon and 1, independently of each other.

  1. Explain how you could construct a circular spinner to simulate Regina’s arrival time. In particular, what values go at the “3 o’clock”, “6 o’clock”, and “9 o’clock” points on the spinner’s axis? (Here “3 o’clock” refers to the direction on the “clock” (spinner), and not arrival time.)
  2. Explain how you could use the spinner from the previous part to simulate an outcome, that is, a pair of arrival times for Regina and Cady.
  3. Why could we not simulate this situation with a box model?

Solution 3.3.

  1. Imagine a circular spinner with an infinitely fine needle that after being spun well points to a value on the spinner’s axis uniformly at random. We could label the axis like the spinner like the one in Figure 2.1, but ranging from 0 to 60 instead of 0 to 1; see Figure 3.3, which we’ll call the “Uniform(0, 60)” spinner. (Or we could just use the spinner in Figure 2.1 to get the arrival time as a fraction of the hour, then multiply the result of a spin by 60.) Only selected rounded values are displayed on the circular axis, but in the idealized model the spinner is infinitely precise so that any real number between 0 and 60 is a possible outcome. Assuming the arrival time is uniformly distributed, Regina should have a probability of \(15/60=0.25\) of arriving in each of the 15-minutes intervals \([0, 15]\), \([15, 30]\), \([30, 45]\), and \([45, 60]\). Therefore, 15 should go at “3 o’clock”, 30 at “6 o’clock”, and 45 at “9 o’clock”, just like the minutes on a regular clock.
  2. Since the arrival times are independent and follow the same uniform pattern, we can spin the spinner in Figure 3.3 twice to simulate an outcome; the first spin represents Regina’s arrival time, the second Cady’s. The order of the spins matters; for example, the pair (34.89, 12.56) represents Regina arriving at 34.89 minutes after noon, while the pair (12.56, 34.89) represents Regina arriving at 12.56 minutes after noon (and vice versa for Cady).
  3. An outcome consists of a pair of values from the continuous interval [0, 60]. Since this interval is uncountable, it’s not possible to write every real number in the interval [0, 60] on a ticket to place in a box. To use a box model, we would have to round arrival times to some desired degree of precision—nearest minute, second, millisecond, etc—and put the rounded values on the tickets. Which is probably fine for practical purposes! But if we really want to create a tactile representation of continuous outcomes, a box model won’t work; we tend to use spinners instead.
(a) Values at 1 o’clock, 2 o’clock, etc
(b) Spinner axis marked at 10 minute intervals
Figure 3.3: A continuous Uniform(0, 60) spinner. Only selected rounded values are displayed, but in the idealized model the spinner is infinitely precise so that any real number between 0 and 60 is a possible outcome.

Notice that the values on the circular axis in Figure 3.3 are evenly spaced. For example, the intervals [0, 15], [15, 30], [30, 45], and [45, 60], all of length 15, each represent 25% of the spinner area. If we spin the idealized spinner represented by Figure 3.3 10 times, our results might look like those in Table 3.2.

Table 3.2: Results of 10 spins of the Uniform(0, 60) spinner.
Results of 10 spins of the Uniform(0, 60) spinner.
Spin Result
1 1.062914
2 17.729239
3 21.610468
4 27.190663
5 36.913450
6 19.604730
7 14.479123
8 25.198875
9 2.805937
10 17.456726

Notice the number of decimal places. If the sample space is [0, 60], any value in the continuous interval between 0 and 60 is a distinct possible value: 10.25000000000… is different from 10.25000000001… which is different from 10.2500000000000000000001… and so on.

Figure 3.4 displays the 10 values in Table 3.2 plotted along a number line. The values are roughly evenly spaced, but there is some natural variability. (Though it’s hard to discern any pattern with only 10 values.)

Figure 3.4: Plot of the 10 values simulated from a Uniform(0, 60) model, recorded in Table 3.2

To simulate a (Regina, Cady) pair of arrival times, we would spin the Uniform(0, 60) spinner twice. Let \(R\) be the result of the first spin, representing Regina’s arrival time, and \(Y\) the second spin for Cady. Also let \(T=\min(R, Y)\) be the time (minutes after noon) at which the first person arrives, and \(W=|R-Y|\) be the time (minutes) the first person to arrive waits for the second person to arrive. Table 3.3 displays the values of \(R\), \(Y\), \(T\), \(W\) for 10 simulated repetitions, each repetition consisting of a pair of spins of the Uniform(0, 60) spinner.

Table 3.3: Results of 10 repetitions of a simulation of the independent Uniform model in the two-person meeting problem.
Results of 10 repetitions of a simulation of the independent Uniform model in the two-person meeting problem.
Repetition R Y T W
1 49.50512 52.748971 49.505123 3.243848
2 55.38527 22.040907 22.040907 33.344360
3 7.52414 9.956804 7.524140 2.432664
4 51.33105 39.636179 39.636179 11.694869
5 43.77325 26.338476 26.338476 17.434775
6 42.51350 28.619413 28.619413 13.894083
7 31.15456 41.117529 31.154557 9.962971
8 44.01602 4.406319 4.406319 39.609700
9 35.46720 8.867082 8.867082 26.600122
10 56.58527 7.784337 7.784337 48.800934

Figure 3.5 plots the 10 simulated \((R, Y)\) pairs in Table 3.3

Figure 3.5: Plot of 10 pairs of values, each pair simulated independently from a Uniform(0, 60) model, recorded in Table 3.3. Each dot represents a pair. The components of each pair are marked on the horizontal and vertical axes.

Now suppose we keep repeating the process, resulting in many simulated (Regina, Cady) pairs of arrival times. Figure 3.6 displays 1000 simulated pairs of arrival times, resulting from 1000 pairs of spins of the Uniform(0, 60) spinner.

Figure 3.6: Plot of 1000 pairs of values, each pair simulated independently from a Uniform(0, 60) model. Each dot represents a pair. The components of each pair are marked on the horizontal and vertical axes.

We see that the pairs are fairly evenly distributed throughout the square with sides [0, 60] representing the sample space (though there is some “clumping” due to natural variability). If we simulated more values and summarized them in an appropriate plot, we would expect to see something like Figure 2.12.

3.2.2 A non-uniform distribution

Example 3.4 Imagine we have an unlabeled circular spinner with an infinitely fine needle that after being spun well lands on a point on the spinner’s axis uniformly at random. But now suppose that Regina’s arrival time is not uniformly distributed. How should we label the spinner’s circular axis to reflect assumptions about Regina’s arrival time?

  1. In particular, suppose that Regina has a probability of 0.25 of arriving in each of the intervals \([0, 23]\), \([23, 30]\), \([30, 37]\), \([37, 60]\). Start to label the spinner for simulating Regina’s arrival time; what values should go at the “3 o’clock”, “6 o’clock”, and “9 o’clock” positions on the spinner’s circular axis?

  2. Also assume that Regina has a probability of (roughly):

    • 0.025 of arriving in the interval \([0, 10]\)
    • 0.135 of arriving in the interval \([10, 20]\)
    • 0.135 of arriving in the interval \([40, 50]\)
    • 0.025 of arriving in the interval \([50, 60]\)

    At what points on the spinner’s circular axis should the values 10, 20, 40, and 50 go? What is the probability that Regina arrives in the interval \([20, 30]\)? \([30, 40]\)?

  3. What does this spinner say about Regina’s arrival time: is she most likely to arrive near noon, near 12:30, or near 1:00?

Solution 3.4.

  1. We can split the spinner into four equal sectors representing the four intervals. Therefore, 23 should go at 3 o’clock, 30 at 6 o’clock, and 37 at 9 o’clock. See Figure 3.7 (a). This part only provides enough information to label 3, 6, and 9 o’clock; we’ll discuss where the other values come from below. (Also, we have rounded values in this example.)
  2. Starting from 0 and moving clockwise around the spinner
    • 10 goes 2.5% of the way around
    • 20 goes 16% of the way around, at roughly 2 o’clock (2.5% to get from 0 to 10 then another 13.5% to get from 10 to 20)
    • 30 goes at 6 o’clock like before
    • 40 goes 84% of the way around, at roughly 10 o’clock (\(1 - 0.025 - 0.135 = 0.84\))
    • 50 goes 97.5% of the way around (\(1 - 0.025 = 0.975\)) Since the probability of arriving in \([0, 30]\) is 0.5, the probability of arriving in \([20, 30]\) is \(0.5 - 0.16 = 0.34\). Similarly, the probability of arriving in \([30, 40]\) is 0.34. See Figure 3.7 (b) (We have rounded values a little differently for this example.)
  3. Part 1 shows that Regina is as likely as not to arrive between 12:23 and 12:37. Part 2 shows that Regina has a probability of about 0.68 of arriving withing 10 minutes of 12:30, but only a probability of about 0.05 of arriving within 10 minutes of either noon or 1:00. So under these assumptions she is more likely to arrive near 12:30 than near noon or 1:00.
(a) Values at 1 o’clock, 2 o’clock, etc
(b) Spinner axis marked at 10 minute intervals
Figure 3.7: A continuous Normal(30, 10) spinner. The same spinner is displayed on both sides, with different features highlighted on the left and right. Only selected rounded values are displayed, but in the idealized model the spinner is infinitely precise so that any real number is a possible outcome. Notice that the values on the axis are not evenly spaced.

Imagine we have an unlabeled circular spinner with an infinitely fine needle that after being spun well lands on a point on the spinner’s axis uniformly at random. Even though the needle lands uniformly, we can use the spinner to represent non-uniform probability measures by labeling the spinner’s circular axis appropriately. We can “stretch” intervals with higher probability, and “shrink” intervals with lower probability. We have already done this intuitively with discrete spinners; see Figure 1.6 and Figure 2.9. But the same idea applies to continuous spinners.

Figure 3.7 displays a “Normal(30, 10)” spinner. Only selected rounded values are displayed, but the needle can land—uniformly at random—at any point on the continuous circular axis. But pay close attention to the circular axis; the values are not equally spaced. For example, the bottom half of the spinner corresponds to the interval [23.26, 36.74], with length 13.48 minutes, while the upper half of the spinner corresponds to the intervals [0, 23.26] and [36.74, 60], with total length 46.52 minutes. Compared with the Uniform(0, 60) spinner, in the Normal(30, 10) spinner intervals near 30 are “stretched out” to reflect a higher probability of arriving near 12:30, while intervals near 0 and 60 are “shrunk” to reflect a lower probability of arriving near 12:00 or 1:00. The interval [20, 40] represents about 68% of the spinner area, so if we spin this spinner many times, about 68% of the spins will land in the interval [20, 40]. A person whose arrival time is represented by this spinner has a probability of about 0.68 of arriving within 10 minutes of 12:30 compared to \(20/60 = 0.33\) for the Uniform(0, 60) model, and a probability of about 0.05 of arriving with 10 minutes of either noon or 1:00, compared to \(20/60=0.33\) in the Uniform(0, 60) model. (The spinner on the left is divided into 12 wedges of equal area, so each wedge represents 8.33% of the probability. Not all values on the axis are labeled, but you can use the wedges to eyeball probabilities.)

The particular pattern represented by the spinner in Figure 3.7 is a Normal(30, 10) distribution; that is, a Normal distribution with mean 30 and standard deviation 10. We will see Normal distributions in much more detail later. For now, just know that a Normal(30, 10) model reflects one particular pattern of non-uniform probability5. Table 3.4 compares probabilities of selected intervals under the Uniform(0, 60) and Normal(30, 10) models.

Warning

The parameters in the uniform probability model are different from those in the Normal model. In the Uniform(0, 60) model, 0 is the minimum possible value; in the Normal(30, 10) model, 30 is the average value. In the Uniform(0, 60) model, 60 is the maximum possible value; in the Normal(30, 10) model, 10 is the standard deviation.

Table 3.4: Comparison of probabilities for the Uniform(0, 60) and Normal(30, 10) models
Interval Uniform(0, 60) probability Normal(30, 10) probability
[0, 10] 0.167 0.025
[10, 20] 0.167 0.136
[20, 30] 0.167 0.341
[30, 40] 0.167 0.341
[40, 50] 0.167 0.136
[50, 60] 0.167 0.025

If we spin the idealized spinner represented by Figure 3.7 10 times, our results might look like those in Table 3.5.

Table 3.5: Results of 10 spins of the Normal(30, 10) spinner.
Spin Result
1 21.38732
2 50.71937
3 24.08964
4 32.68036
5 54.51060
6 38.11696
7 16.81857
8 22.97995
9 28.86703
10 43.43031

Figure 3.8 displays the 10 values in Table 3.5 plotted along a number line. We tend to see more values near 30 than near 0 or 60 (though it’s hard to discern any pattern with only 10 values).

Figure 3.8: Plot of the 10 values simulated from a Normal(30, 10) model, recorded in Table 3.5.

If we spin the spinner in Figure 3.7 many times,

  • About half of the simulated values would be below 30 and half above
  • Because axis values near 30 are stretched out, values near 30 would occur with higher frequency than those near 0 or 60.
  • The shape of the distribution would be symmetric about 30 since the axis spacing of values below 30 mirrors that for values above 30. For example, about 34% of values would be between 20 and 30, and also 34% between 30 and 40.
  • About 68% of values would be between 20 and 40.
  • About 95% of values would be between 10 and 50.

And so on. We could compute percentages for other intervals by measuring the areas of corresponding sectors on the spinner to complete the pattern of variability that values resulting from this spinner would follow. This particular pattern is called a “Normal(30, 10)” distribution, which we will explore in much more detail later (in particular, see ?sec-histogram-to-density).

Now suppose Regina’s and Cady’s arrival times are each reasonably modeled by a Normal(30, 10) model, independently of each other. To simulate a (Regina, Cady) pair of arrival times, we would spin the Normal(30, 10) spinner twice. Table 3.6 displays the results of 10 repetitions, each repetition resulting in a (Regina, Cady) pair.

Table 3.6: Results of 10 pairs of spins of the Normal(30, 10) spinner.
Repetition Regina's time Cady's time
1 31.758630 50.89160
2 7.842356 34.23954
3 29.644008 40.34209
4 25.154312 26.89845
5 38.011497 32.04975
6 27.325808 42.97185
7 44.594399 38.65448
8 18.557759 30.99171
9 30.142711 36.53947
10 26.479249 30.23044

Figure 3.9 plots the 10 simulated pairs in Table 3.6

Figure 3.9: Plot of 10 pairs of values, each pair simulated from a Normal(30, 10) model, recorded in Table 3.6. Each dot represents a pair. The components of each pair are marked on the horizontal and vertical axes.

Figure 3.10 displays 1000 pairs of (Regina, Cady) arrival times, resulting from 1000 pairs of spins of the Normal(30, 10) spinner. Compared with the simulated pairs from the Uniform(0, 60) spinner (in Figure 3.6), we see many more simulated pairs in the center of the plot (when both arrive near 12:30) than in the corners of the plot (where either arrives near 12:00 or 1:00). If we simulated more values and summarized them in an appropriate plot, we would expect to see something like Figure 2.14.

Figure 3.10: Plot of 1000 pairs of values, each pair simulated from a Normal(30, 10) model. Each dot represents a pair. The components of each pair are marked on the horizontal and vertical axes.

Example 3.3 and Example 3.4 assumed that Regina’s and Cady’s arrival times individually followed the same model, both Uniform(0, 60) or both Normal(30, 10), so we just spun the same spinner twice to simulate a pair of arrival times. However, we could easily simulate from different models. Suppose Regina’s arrival time follows a Uniform(0, 60) model while Cady’s follows a Normal(30, 10) model, independently of each other. Then we could simulate a pair of arrival times by spinning the Uniform(0, 60) spinner for Regina and the Normal(30, 10) spinner for Cady.

So far we have assumed that Regina and Cady arrive independently, but what if they coordinate and their arrival times are related? Recall that Figure 2.15 reflects a model where Regina and Cady each are more likely to arrive around 12:30 than noon or 1:00, and also more likely to arrive around the same time. In such a situation, we could still use spinners to simulate pairs of arrival times, but it’s more involved than just spinning a single spinner twice (or having just one spinner for each person). We’ll revisit using spinners to simulate dependent pairs later.

3.2.3 Exercises

Exercise 3.4 Katniss throws a dart at a circular dartboard with radius 12 inches. Suppose that the dart lands uniformly at random anywhere on the dartboard, and assume that Katniss’s dart never misses the dartboard. Suppose that the dartboard is on a coordinate plane, with the center of the dartboard at (0, 0) and the north, south, east, and west edges, respectively, at coordinates (0, 12), (0,-12), (12, 0), (-12, 0). When the dart hits the board its \((X, Y)\) coordinates are recorded. Let \(R\) be the distance (inches) from the location of the dart to the center of the dartboard.

  1. Sketch a Uniform(0, 12) spinner.
  2. Describe how you could use a fair coin and the Uniform(0, 12) spinner to simulate the \((X, Y)\) coordinates of a single throw of the dart amd the value of \(R\). Hint: you might need to flip the coin and spin the spinner multiple times. What will you do if your simulated coordinates correspond to “off the board”?
  3. Suppose you want to simulate \(R\) directly. Could you just spin the Uniform(0, 12) spinner and record the value? Explain. Hint: see Exercise 2.13.

Exercise 3.5 Continuing Exercise 3.4. Computing like we did in Exercise 2.13, we can show

Range Probability that $R$ is in range
0 to 1 0.0069
1 to 2 0.0208
2 to 3 0.0347
3 to 4 0.0486
4 to 5 0.0625
5 to 6 0.0764
6 to 7 0.0903
7 to 8 0.1042
8 to 9 0.1181
9 to 10 0.1319
10 to 11 0.1458
11 to 12 0.1597

Sketch a spinner that could be used to simulate values of \(R\). You should label 12 sectors on the spinner. Hint: the sectors won’t be the same size and the values on the outside axis of the spinner won’t be evenly spaced.

3.3 Computer simulation: Dice rolling

We will perform computer simulations using the Python package Symbulate. The syntax of Symbulate mirrors the language of probability in that the primary objects in Symbulate are the same as the primary components of a probability model: probability spaces, random variables, events. Once these components are specified, Symbulate allows users to simulate many times from the probability model and summarize the results.

This section contains a brief introduction to Symbulate; more examples can be found throughout the text. Symbulate can be installed with pip.

pip install git+https://github.com/kevindavisross/symbulate

Import Symbulate during a Python session using the following command.

from symbulate import *

We’ll start with a dice rolling example. Unless indicated otherwise, in this section \(X\) represents the sum of two rolls of a fair four-sided die, and \(Y\) represents the larger of the two rolls (or the common value if a tie). We have already discussed a tactile simulation; now we’ll carry out the process on a computer.

There aren’t many examples for you to work in this section. Instead, we encourage you to open a Python session and copy and run the code as you read. In particular, you can run Python code using this template Colab notebook which includes the code needed to get started with Symbulate.

3.3.1 Simulating outcomes

The following Symbulate code defines a probability space6 P for simulating the 16 equally likely ordered pairs of rolls via a box model.

P = BoxModel([1, 2, 3, 4], size = 2, replace = True)

The above code tells Symbulate to draw 2 tickets (size = 2), with replacement7, from a box with tickets labeled 1, 2, 3, and 4 (entered as the Python list [1, 2, 3, 4]). Each simulated outcome consists of an ordered8 pair of rolls .

The sim(r) command simulates r realizations of probability space outcomes (or events or random variables). Here is the result of one repetition.

P.sim(1)
Index Result
0(1, 4)

And here are the results of 10 repetitions. (We will typically run thousands of repetitions, or more, but in this section we just run a few repetitions for illustration.)

P.sim(10)
Index Result
0(2, 3)
1(3, 3)
2(3, 1)
3(3, 1)
4(4, 1)
5(1, 2)
6(4, 3)
7(2, 2)
8(4, 4)
......
9(2, 1)
Warning

Every time .sim() is called, a new simulation is run. When running a simulation, the “simulate” step should be performed with a single call to sim so that all analysis of results corresponds to the same simulated values. We’ll show how to do this below.

3.3.2 Simulating random variables

A Symbulate RV is specified by the probability space on which it is defined and the mapping function which defines it. Recall that \(X\) is the sum of the two dice rolls and \(Y\) is the larger (max).

X = RV(P, sum)
Y = RV(P, max)

The above code simply defines the random variables. Again, we can simulate values with .sim(). Since every call to sim runs a new simulation, we typically store the simulation results as an object. The following commands simulate 100 values of the random variable X and store the results as x. For consistency with standard probability notation9, the random variable itself is denoted with an uppercase letter X, while the realized values of it are denoted with a lowercase letter x.

x = X.sim(100)

x # this just displays x; only the first few and last values will print
Index Result
05
17
26
37
46
55
65
76
85
......
995

3.3.3 Simulating multiple random variables

If we call X.sim(10000) and Y.sim(10000) we get two separate simulations of 10000 pairs of rolls, one which returns the sum of the rolls for each repetition, and the other the max. If we want to study relationships between \(X\) and \(Y\) we need to compute both \(X\) and \(Y\) for each pair of rolls in the same simulation.

We can simulate \((X, Y)\) pairs using10 &. We store the simulation output as x_and_y to emphasize that x_and_y contains pairs of values.

x_and_y = (X & Y).sim(10)

x_and_y # this just displays x_and_y
Index Result
0(5, 3)
1(3, 2)
2(4, 3)
3(5, 3)
4(7, 4)
5(8, 4)
6(6, 4)
7(4, 2)
8(5, 4)
......
9(7, 4)

Think of x_and_y as a table with two columns. We can select columns using brackets []. Remember that Python uses zero-based indexing, so 0 corresponds to the first column, 1 to the second, etc.

x = x_and_y[0]

x
Index Result
05
13
24
35
47
58
66
74
85
......
97
y = x_and_y[1]

y
Index Result
03
12
23
33
44
54
64
72
84
......
94

We can also select multiple columns.

x_and_y[0, 1]
Index Result
0(5, 3)
1(3, 2)
2(4, 3)
3(5, 3)
4(7, 4)
5(8, 4)
6(6, 4)
7(4, 2)
8(5, 4)
......
9(7, 4)

3.3.4 Simulating outcomes and random variables

When calling X.sim(10) or (X & Y).sim(10) the outcomes of the rolls are generated in the background but not saved. We can create a RV which returns the outcomes of the probability space11. The default mapping function for RV is the identity function, \(g(u) = u\), so simulating values of U = RV(P) below returns the outcomes of the BoxModel P representing the outcome of the two rolls.

U = RV(P)

U.sim(10)
Index Result
0(4, 1)
1(4, 2)
2(4, 2)
3(2, 1)
4(4, 1)
5(1, 2)
6(1, 2)
7(4, 3)
8(1, 3)
......
9(1, 2)

Now we can simulate and display the outcomes along with the values of \(X\) and \(Y\) using &.

(U & X & Y).sim(10)
Index Result
0((2, 2), 4, 2)
1((1, 3), 4, 3)
2((1, 1), 2, 1)
3((1, 2), 3, 2)
4((1, 1), 2, 1)
5((1, 3), 4, 3)
6((1, 2), 3, 2)
7((2, 3), 5, 3)
8((3, 4), 7, 4)
......
9((1, 1), 2, 1)

Because the probability space P returns pairs of values, U = RV(P) above defines a random vector. The individual components12 of U can be “unpacked” as U1, U2 in the following. Here U1 is an RV representing the result of the first roll and U2 the second.

U1, U2 = RV(P)
(U1 & U2 & X & Y).sim(10)
Index Result
0(4, 2, 6, 4)
1(4, 4, 8, 4)
2(4, 2, 6, 4)
3(3, 4, 7, 4)
4(2, 4, 6, 4)
5(3, 3, 6, 3)
6(4, 2, 6, 4)
7(3, 2, 5, 3)
8(3, 1, 4, 3)
......
9(2, 2, 4, 2)

3.3.5 Simulating events

Events involving random variables can also be defined and simulated. For programming reasons, events are enclosed in parentheses () rather than braces \(\{\}\). For example, we can define the event that the larger of the two rolls is less than 3, \(A=\{Y<3\}\), as

A = (Y < 3) # an event

We can use sim to simulate events. A realization of an event is True if the event occurs for the simulated outcome, or False if not.

A.sim(10)
Index Result
0False
1False
2True
3False
4True
5False
6True
7False
8False
......
9False

For logical equality use a double equal sign ==. For example, (Y == 3) represents the event \(\{Y=3\}\).

(Y == 3).sim(10)
Index Result
0False
1False
2True
3False
4True
5True
6False
7True
8False
......
9True

In most situations, events of interest involve random variables. It is much more common to simulate random variables directly, rather than events; see Section 3.12.1 for further discussion. Events are primarily used in Symbulate for conditioning.

3.3.6 Simulating transformations of random variables

Transformations of random variables (defined on the same probability space) are random variables. If X is a Symbulate RV and g is a function, then g(X) is also a Symbulate RV.

For example, we can simulate values of \(X^2\). (In Python, exponentiation is represented by **; e.g., 2 ** 5 = 32.)

(X ** 2).sim(10)
Index Result
049
125
216
39
425
549
625
764
816
......
94

For many common functions (e.g., sqrt, log, cos), the syntax g(X) is sufficient. Here sqrt(u) is the square root function \(g(u) = \sqrt{u}\).

(X & sqrt(X)).sim(10)
Index Result
0(2, 1.4142135623730951)
1(7, 2.6457513110645907)
2(3, 1.7320508075688772)
3(4, 2.0)
4(5, 2.23606797749979)
5(4, 2.0)
6(6, 2.449489742783178)
7(3, 1.7320508075688772)
8(5, 2.23606797749979)
......
9(6, 2.449489742783178)

For general functions, including user defined functions, the syntax for defining \(g(X)\) is X.apply(g). Here we define the function \(g(u) = (u-5)^2\) and then define13 the random variable \(g(X) = (X - 5)^2\).

# define a function g; u represents a generic input
def g(u):
  return (u - 5) ** 2

# define the RV g(X)  
Z = X.apply(g)

# simulate X and Z pairs
(X & Z).sim(10)
Index Result
0(4, 1)
1(8, 9)
2(6, 1)
3(7, 4)
4(6, 1)
5(7, 4)
6(5, 0)
7(6, 1)
8(6, 1)
......
9(6, 1)

We can also apply transformations of multiple RVs defined on the same probability space. (We will look more closely at how Symbulate treats this “same probability space” issue later.)

For example, we can simulate values of \(XY\), the product of \(X\) and \(Y\).

(X * Y).sim(10)
Index Result
02
18
220
320
428
520
628
718
812
......
96

Recall that we defined \(X\) via X = RV(P, sum). Defining random variables \(U_1, U_2\) to represent the individual rolls, we can define \(X=U_1 + U_2\). Recall that we previously defined14 U1, U2 = RV(P).

X = U1 + U2

X.sim(10)
Index Result
04
16
26
34
46
55
67
75
87
......
98

Unfortunately max(U1, U2) does not work, but we can use the apply syntax. Since we want to apply max to \((U_1, U_2)\) pairs, we must15 first join them together with &.

Y  = (U1 & U2).apply(max)

Y.sim(10)
Index Result
03
12
23
34
44
54
64
73
84
......
92

3.3.7 Other probability spaces

So far we have assumed a fair four-sided die. Now consider the weighted die in Example 2.29: a single roll results in 1 with probability 0.1, 2 with probability 0.2, 3 with probability 0.3, and 4 with probability 0.4. BoxModel assumes equally likely tickets by default, but we can specify non-equally likely tickets using the probs option. The probability space WeightedRoll in the following code corresponds to a single roll of the weighted die; the default size option is 1.

WeightedRoll = BoxModel([1, 2, 3, 4], probs = [0.1, 0.2, 0.3, 0.4])
WeightedRoll.sim(10)
Index Result
01
12
22
33
42
54
64
72
84
......
92

We could add size = 2 to the BoxModel to create a probability space corresponding to two rolls of the weighted die. Alternatively, we can think of BoxModel([1, 2, 3, 4], probs = [0.1, 0.2, 0.3, 0.4]) as defining the middle spinner in Figure 2.9 that we want to spin two times, which we can do with ** 2.

Q = BoxModel([1, 2, 3, 4], probs = [0.1, 0.2, 0.3, 0.4]) ** 2
Q.sim(10)
Index Result
0(4, 3)
1(1, 4)
2(4, 4)
3(3, 3)
4(3, 4)
5(3, 4)
6(1, 2)
7(1, 3)
8(4, 3)
......
9(3, 4)

You can interpret BoxModel([1, 2, 3, 4], probs = [0.1, 0.2, 0.3, 0.4]) as defining the middle spinner in Figure 2.9 and ** 2 as “spin the spinner two times”. In Python, ** represents exponentiation; e.g., 2 ** 5 = 32. So BoxModel([1, 2, 3, 4]) ** 2 is equivalent to BoxModel([1, 2, 3, 4]) * BoxModel([1, 2, 3, 4]). In light of our discussion in Section 2.7, the product * notation should seem natural for independent spins.

We could use the product notation to define a probability space corresponding to a pair of rolls, one from a fair die and one from a weighted-die.

MixedRolls = BoxModel([1, 2, 3, 4]) * BoxModel([1, 2, 3, 4], probs = [0.1, 0.2, 0.3, 0.4])
MixedRolls.sim(10)
Index Result
0(1, 3)
1(3, 4)
2(3, 4)
3(3, 3)
4(1, 3)
5(1, 2)
6(3, 4)
7(3, 4)
8(1, 4)
......
9(3, 1)

Now consider the weighted die from Example 2.30, represented by the spinner in Figure 2.9 (c). We could use the probs option, but we can also imagine a box model with 15 tickets—four tickets labeled 1, six tickets labeled 2, three tickets labeled 3, and two tickets labeled 4—from which a single ticket is drawn. A BoxModel can be specified in this way using the following {label: number of tickets with the label} formulation16. This formulation is especially useful when multiple tickets are drawn from the box without replacement.

Q = BoxModel({1: 4, 2: 6, 3: 3, 4: 2})
Q.sim(10)
Index Result
02
14
22
32
44
52
62
74
82
......
92

While many scenarios can be represented by box models, there are also many Symbulate probability spaces other than BoxModel. When tickets are equally likely and sampled with replacement, a Discrete Uniform model can also be used. Think of a DiscreteUniform(a, b) probability space corresponding to a spinner with sectors of equal area labeled with integer values from a to b (inclusive). For example, the spinner in Figure 2.9 (a) corresponds to the DiscreteUniform(1, 4) model. This gives us another way to represent the probability space corresponding to two rolls of a fair-four sided die.

P = DiscreteUniform(1, 4) ** 2
P.sim(10)
Index Result
0(2, 3)
1(2, 1)
2(4, 2)
3(2, 4)
4(4, 1)
5(3, 2)
6(1, 3)
7(1, 2)
8(1, 4)
......
9(1, 4)

Note that BoxModel is the only probability space with the size argument. For other probability spaces, the product * or exponentiation ** notation must be used to simulate multiple spins.

This section has only introduced how to set up a probability model and simulate realizations. We’ll see how to summarize and use simulation output soon.

3.3.8 Exercises

Exercise 3.6 The latest series of collectible Lego Minifigures contains 3 different Minifigure prizes (labeled 1, 2, 3). Each package contains a single unknown prize. Suppose we only buy 3 packages and we consider as our sample space outcome the results of just these 3 packages (prize in package 1, prize in package 2, prize in package 3). For example, 323 (or (3, 2, 3)) represents prize 3 in the first package, prize 2 in the second package, prize 3 in the third package. Let \(X\) be the number of distinct prizes obtained in these 3 packages. Let \(Y\) be the number of these 3 packages that contain prize 1.

Write Symbulate code to define an appropriate probability space and random variables, and simulate a few repetitions. Hint: you’ll need to define a function to define \(X\); try len(unique(...))

Exercise 3.7 Continuing Exercise 3.6. Now suppose that 10% of boxes contain prize 1, 30% contain prize 2, and 60% contain prize 3. Write Symbulate code to define an appropriate probability space and random variables, and simulate a few repetitions.

Exercise 3.8 Recall the birthday problem of Example 2.48. Let \(B\) be the event that at least two people in a group of \(n\) share a birthday.

Write Symbulate code to define an appropriate probability space, and simulate a few realization of \(B\) (that is, simulate whether or not \(B\) occurs).

Exercise 3.9 Maya is a basketball player who makes 40% of her three point field goal attempts. Suppose that at the end of every practice session, she attempts three pointers until she makes one and then stops. Let \(X\) be the total number of shots she attempts in a practice session. Assume shot attempts are independent, each with probability of 0.4 of being successful.

Write Symbulate code to define an appropriate probability space and random variable, and simulate a few repetitions.

3.4 Computer simulation: Meeting problem

Now we’ll introduce computer simulation of the continuous models in Section 3.2. Throughout this section we’ll consider the two-person meeting problem. Let \(R\) be the random variable representing Regina’s arrival time (minutes after noon, including fractions of a minute), and \(Y\) for Cady. Also let \(T=\min(R, Y)\) be the time (minutes after noon) at which the first person arrives, and \(W=|R-Y|\) be the time (minutes) the first person to arrive waits for the second person to arrive.

3.4.1 Independent Uniform model

First consider the situation of Example 3.3 where Regina and Cady each arrive at a time uniformly at random in [0, 60], independently of each other. The following code defines a Uniform(0, 60) spinner17, like in Figure 3.3, which we spin twice to get the (Regina, Cady) pair of outcomes.

P = Uniform(0, 60) ** 2
P.sim(10)
Index Result
0(58.10944665550521, 0.9452480289172271)
1(17.575250047925934, 10.664339811835909)
2(15.306413182621396, 55.1381382448321)
3(20.55283460981003, 11.843989187382887)
4(26.22467122663007, 0.6990662184319607)
5(35.52730703467107, 58.0864445350071)
6(50.76929953912868, 28.60426333112853)
7(51.82740034364747, 52.05358398895726)
8(39.8572591670875, 39.10118853387785)
......
9(5.015519045703732, 14.135718037002425)

Notice (again) the number of decimal places; any value in the continuous interval between 0 and 60 is a distinct possible value

A probability space outcome is a (Regina, Cady) pair of arrival times. We can define the random variables \(R\) and \(Y\), representing the individual arrival times, by “unpacking” the outcomes.

R, Y = RV(P)

(R & Y).sim(10)
Index Result
0(30.577437999584625, 10.103698991206581)
1(4.596063184421029, 8.097587994172526)
2(29.761252180474248, 32.826371945385034)
3(39.1902476056498, 7.9687641693768185)
4(17.35377245807986, 33.94324017398339)
5(37.662494642762546, 14.519653754733255)
6(21.662577486835517, 29.62131632000045)
7(59.95280666271828, 48.45420476365996)
8(26.577913183796372, 46.71508475652507)
......
9(14.953340580837525, 43.87374290745507)

We can define \(W = |R-Y|\) using the abs function. In order to define \(T = \min(R, Y)\) we need to use the apply syntax with R & Y.

W = abs(R - Y)

T = (R & Y).apply(min)

Now we can simulate values of \(R\), \(Y\), \(T\), and \(W\) with a single call to sim. Each row in the resulting Table 3.7 corresponds to a single simulated outcome (pair of arrival times).

(R & Y & T & W).sim(10)
Table 3.7
Index Result
0(30.741137656898957, 48.70770133650778, 30.741137656898957, 17.966563679608825)
1(37.98479801341665, 58.316779339999414, 37.98479801341665, 20.331981326582763)
2(37.04390917061587, 39.98476919392394, 37.04390917061587, 2.940860023308069)
3(40.854357716917974, 9.44801419458264, 9.44801419458264, 31.406343522335334)
4(21.464597213901826, 45.22860940217012, 21.464597213901826, 23.764012188268296)
5(25.136321239693782, 5.453715741158087, 5.453715741158087, 19.682605498535693)
6(21.317480090354167, 46.90438308481847, 21.317480090354167, 25.586902994464303)
7(3.3609673791360994, 45.269273976382095, 3.3609673791360994, 41.908306597245996)
8(13.70133612400396, 51.2636606144166, 13.70133612400396, 37.56232449041264)
......
9(38.56147415305798, 41.40699403710109, 38.56147415305798, 2.8455198840431137)

We can simulate values of \(R\) and plot them along a number line in a “rug” plot; compare to Figure 3.4. Notice how we have chained together the sim and plot commands. (We’ll see some more interesting and useful plots later.)

plt.figure()
R.sim(100).plot('rug')
plt.show()

Calling .plot() (without 'rug') for simulated values of a continuous random variable produces a histogram, which we will discuss in much more detail in ?sec-simulation-marginal-continuous.

plt.figure()
R.sim(10000).plot()
plt.show()

We can simulate and plot many \((R, Y)\) pairs of arrival times; compare to Figure 3.6 (without the rug).

(R & Y).sim(1000).plot()

When plotting simulated pairs, the first component is plotted on the horizontal axis and the second component on the vertical axis. In the following plots, we’ll import matplotlib.pyplot as plt and use plt.xlabel("x-axis text") and plt.ylabel("y-axis text")to label axes.

We can also simulate and plot many \((T, W)\) pairs. For purposes of illustration, first we simulate all four random variables, then we’ll select the columns corresponding to \((T, W)\) and plot the simulated pairs.

import matplotlib.pyplot as plt

meeting_sim = (R & Y & T & W).sim(1000)

meeting_sim[2, 3].plot()
plt.xlabel("T")
plt.ylabel("W")

3.4.2 Independent Normal model

Now consider a Normal(30, 10) model, represented by the spinner in Figure 3.7.

Normal(mean = 30, sd = 10).sim(10)
Index Result
035.475839821165046
134.95281551586672
210.191451144471579
337.999202502092025
438.33744492628745
523.614080861064924
632.89218341733457
733.11974045981144
823.5800998826558
......
932.38985921424401

We define a probability space corresponding to (Regina, Cady) pairs of arrival times, by assuming that their arrival times individually follow a Normal(30, 10) model, independently of each other. That is, we spin the Normal(30, 10) spinner twice to simulate a pair of arrival times.

P = Normal(30, 10) ** 2
P.sim(10)
Index Result
0(26.630278850676955, 24.13602708849833)
1(1.2961734918732688, 37.99778526966676)
2(20.04424327167561, 39.09904184561107)
3(26.39597655221813, 34.742818525780486)
4(22.35148239452506, 42.446131249255394)
5(27.34006977499485, 28.412354009949247)
6(29.802260476784884, 40.35081344981252)
7(35.63917930982561, 32.12651171404675)
8(25.96437997142845, 31.483716290203223)
......
9(25.128930226000758, 24.11976205302846)

We can unpack the individual \(R\), \(Y\) random variables and define \(W\), \(T\) as before.

R, Y = RV(P)

W = abs(R - Y)

T = (R & Y).apply(min)

We can simulate values of \(R\) and plot them along a number line in a rug plot; compare to Figure 3.8. (Technically, the Normal(30, 10) assigns small but positive probability to values outside of [0, 60], so we might see some values outside of [0, 60].)

plt.figure()
R.sim(100).plot('rug')
plt.show()

We can simulate many values and summarize them in a histogram. We’ll discuss histograms in more detail later, but notice that the histogram conveys that for a Normal(30, 10) distribution values near 30 are more likely than values near 0 or 60.

plt.figure()
R.sim(10000).plot()
plt.show()

We can simulate and plot many \((R, Y)\) pairs of arrival times; compare to Figure 3.10 (without the rug).

(R & Y).sim(1000).plot()
plt.xlabel("R");
plt.ylabel("Y");

We can also simulate and plot many \((T, W)\) pairs. Notice that the plot looks quite different from the one for the independent Uniform(0, 60) model for arrival times.

meeting_sim = (R & Y & T & W).sim(1000)

meeting_sim[2, 3].plot()
plt.xlabel("T");
plt.ylabel("W");

3.4.3 Bivariate Normal model

Now assume that Regina and Cady tend to arrive around the same time. We haven’t yet seen how to construct spinners to reflect dependence, but we’ll briefly introduce a particular model and some code. One way to model pairs of values that have a relationship or correlation is with a BivariateNormal model, like in the following.

P = BivariateNormal(mean1 = 30, sd1 = 10, mean2 = 30, sd2 = 10, corr = 0.7)
P.sim(10)
Index Result
0(26.637927718911346, 33.91042359676721)
1(42.73405172178866, 38.2906232491268)
2(16.693104149399048, 20.079810098143593)
3(35.96686142822037, 38.89633916520376)
4(20.91623082924098, 23.92103041133864)
5(34.18977836941798, 29.315834618789467)
6(18.345392663864423, 27.459832728671298)
7(36.17176171655053, 31.928884894347394)
8(33.01235256236067, 39.87509254967353)
......
9(44.62136753966423, 33.160651899100856)

Note that a BivariateNormal probability space returns pairs directly. We can unpack the pairs as before, and plot some simulated values.

R, Y = RV(P)

(R & Y).sim(1000).plot()
plt.xlabel("R");
plt.ylabel("Y");

Now we see that Regina and Cady tend to arrive near the same time, similar to Figure 2.15.

If we plot \((T, W)\) pairs, we expect values of the waiting time \(W\) to tend to be closer to 0 than in the independent arrivals models. Compare the scale on the vertical axis, corresponding to \(W\), in each of the three plots of \((T, W)\) pairs in this section.

W = abs(R - Y)

T = (R & Y).apply(min)

meeting_sim = (R & Y & T & W).sim(1000)

meeting_sim[2, 3].plot()
plt.xlabel("T");
plt.ylabel("W");

A call to sim always simulates independent repetitions from the model, but be careful about what this means. In the Bivariate Normal model, the \(R\) and \(Y\) values are dependent within a repetition. However, the \((R, Y)\) pairs are independent between repetitions. Thinking in terms of a table, the \(R, Y\) columns are dependent, but the rows are independent.

We will study specific probability models like Normal and BivariateNormal in much more detail as we go.

3.4.4 Exercises

Exercise 3.10 Write Symbulate code to conduct the simulation in Exercise 3.4. Caveat: don’t worry about the code to discard repetitions where the dart lands off the board. (We’ll return to this later.)

Exercise 3.11 Consider a continuous version of the dice rolling problem where instead of rolling two fair four-sided dice (which return values 1, 2, 3, 4) we spin twice a Uniform(1, 4) spinner (which returns any value in the continuous range between 1 and 4). Let \(X\) be the sum of the two spins and let \(Y\) be the larger of the two spins. Write Symbulate code to define an appropriate probability space and random variables, and simulate a few repetitions.

3.5 Approximating probabilities: Relative frequencies

We can use simulation-based relative frequencies to approximate probabilities. That is, the probability of event \(A\) can be approximated by simulating—according to the assumptions encoded in the probability measure \(\textrm{P}\)—the random phenomenon a large number of times and computing the relative frequency of \(A\).

\[ {\small \textrm{P}(A) \approx \frac{\text{number of repetitions on which $A$ occurs}}{\text{number of repetitions}}, \quad \text{for a large number of repetitions simulated according to $\textrm{P}$} } \]

In practice, many repetitions of a simulation are performed on a computer to approximate what happens in the “long run”. However, we often start by carrying out a few repetitions by hand to help make the process more concrete.

Example 3.5 Recall your simulation results from Example 3.1; ours are in Table 3.1. Based only on your 10 repetitions, how would you approximate the following? (Don’t worry if the approximations are any good yet.)

  1. \(\textrm{P}(A)\), where \(A\) is the event that the first roll is 3.
  2. \(\textrm{P}(X=6)\)
  3. \(\textrm{P}(X \ge 6)\)
  4. \(\textrm{P}(Y = 3)\)
  5. \(\textrm{P}(Y \ge 3)\)
  6. \(\textrm{P}(X=6, Y=3)\)
  7. \(\textrm{P}(X\ge6, Y \ge 3)\)

Solution 3.5. See Table 3.1 for the results of our simulation.

  1. Approximate \(\textrm{P}(A)\) by 4/10 = 0.4, the relative frequency of event \(A\) in the simulation; that is, the proportion of repetitions where the first roll is 3.
  2. Approximate \(\textrm{P}(X=6)\) by 2/10 = 0.2, the proportion of repetitions where the sum is 6.
  3. Approximate \(\textrm{P}(X\ge 6)\) by 5/10 = 0.5, the proportion of repetitions where the sum is at least 6.
  4. Approximate \(\textrm{P}(Y=3)\) by 3/10 = 0.3, the proportion of repetitions where the max is 3.
  5. Approximate \(\textrm{P}(Y\ge 3)\) by 7/10 = 0.7, the proportion of repetitions where the max is at least 3.
  6. Approximate \(\textrm{P}(X=6, Y = 3)\) by 1/10 = 0.1, the proportion of repetitions where both the sum is 6 and the max is 3.
  7. Approximate \(\textrm{P}(X\ge 6, Y \ge 3)\) by 5/10 = 0.5, the proportion of repetitions where both the sum is at least 6 and the max is at least 3. (Since \(X\ge 6\) implies \(Y\ge 3\), \(\textrm{P}(X\ge 6, Y\ge 3) = \textrm{P}(X\ge 6)\).)

Example 3.6 Continuing Example 3.5.

  1. Construct a table of the simulated relative frequencies of each possible value \(x\) of \(X\).
  2. Construct a table of the simulated relative frequencies of each possible value \((x, y)\) pair of \((X, Y)\).

Solution 3.6. For discrete random variables like these we can make tables or plots summarizing the observed values of the random variables and their corresponding relative frequencies.

Summarizing our simulation results from Table 3.1, the observed values of \(X\) and corresponding relative frequencies are

\(x\) Relative frequency
2 1/10
3 2/10
4 0
5 2/10
6 2/10
7 3/10
8 0

The above table18 represents an approximation of the distribution of \(X\), albeit a bad approximation; compare with Table 2.14.

We can summarize simulated \((X, Y)\) pairs and their relative frequencies, as in the following two-way table; compare with Table 2.17.

\(x, y\) 1 2 3 4
2 1/10 0 0 0
3 0 2/10 0 0
4 0 0 0 0
5 0 0 2/10 0
6 0 0 1/10 1/10
7 0 0 0 3/10
8 0 0 0 0

You might have noticed that many of the simulated relative frequencies in Example 3.5 provide terrible estimates of the corresponding probabilities. For example, the true probability that the first roll is a 3 is \(\textrm{P}(A) = 0.25\) while the simulated relative frequency is 0.4. The problem is that the simulation only consisted of 10 repetitions. Probabilities can be approximated by long run relatively frequencies, but 10 repetitions certainly doesn’t qualify as the long run! The more repetitions we perform the better our estimates should be. But how many repetitions is sufficient? And how accurate are the estimates? We will address these issues in Section 3.6.

3.5.1 A few Symbulate commands for summarizing simulation output

We’ll continue with the dice rolling example. Recall the setup.

P = DiscreteUniform(1, 4) ** 2

X = RV(P, sum)

Y = RV(P, max)

First we’ll simulate and store 10 values of \(X\).

x = X.sim(10)
x # displays the simulated values
Index Result
05
15
22
35
44
55
67
76
84
......
96

Suppose we want to find the relative frequency of 6. We can count the number of simulated values equal to 6 with count_eq().

x.count_eq(6)
2

The count is the frequency. To find the relative frequency we simply divide by the number of simulated values.

x.count_eq(6) / 10
0.2

We can find frequencies of other events using the “count” functions:

  • count_eq(u): count equal to (\(=\)) u
  • count_neq(u): count not equal to (\(\neq\)) u
  • count_leq(u): count less than or equal to (\(\le\)) u
  • count_lt(u): count less than (\(<\)) u
  • count_geq(u): count greater than or equal to (\(\ge\)) u
  • count_gt(u): count greater than (\(>\)) u
  • count: count according to a custom True/False criteria (see examples below)

Using count() with no inputs to defaults to “count all”, which provides a way to count the total number of simulated values. (This is especially useful when conditioning.)

x.count_eq(6) / x.count()
0.2

The tabulate method provides a quick summary of the individual simulated values and their frequencies.

x.tabulate()
Value Frequency
21
42
54
62
71
Total10

By default, tabulate returns frequencies (counts). Adding the argument19 normalize = True returns relative frequencies (proportions).

x.tabulate(normalize = True)
Value Relative Frequency
20.1
40.2
50.4
60.2
70.1
Total1.0

We often initially simulate a small number of repetitions to see what the simulation is doing and check that it is working properly. However, in order to accurately approximate probabilities or distribution we simulate a large number of repetitions (usually thousands for our purposes). Now let’s simulate many \(X\) values and summarize the results.

x = X.sim(10000)

x.tabulate(normalize = True)
Table 3.8: Table representing simulation-based approximate distribution of \(X\), the sum of two rolls of a fair four-sided die.
Value Relative Frequency
20.0531
30.1249
40.1963
50.2506
60.1873
70.1295
80.0583
Total1.0

Compare to Table 2.14; with 10000 repetitions the simulation based approximations are pretty close to the theoretical probabilities.

Graphical summaries play an important role in analyzing simulation output. We have previously seen rug plots of individual simulated values. Rug plot emphasize that realizations of a random variable are numbers along a number line. However, a rug plot does not adequately summarize relative frequencies. Instead, calling .plot() for simulated values of a discrete random variable produces20 an impulse plot which displays the simulated values and their relative frequencies; see Figure 3.11 and compare to Figure 2.16. Since we stored the simulated values as x, the same simulated values are used to produce Table 3.8 and Figure 3.11.

x.plot()
Figure 3.11: Impulse plot representing simulation-based approximate distribution of \(X\), the sum of two rolls of a fair four-sided die.

Now we simulate and summarize a few \((X, Y)\) pairs.

x_and_y = (X & Y).sim(10)
x_and_y 
Index Result
0(8, 4)
1(5, 3)
2(6, 3)
3(6, 3)
4(6, 4)
5(5, 4)
6(3, 2)
7(5, 4)
8(5, 4)
......
9(7, 4)

Pairs of values can also be tabulated.

x_and_y.tabulate()
Value Frequency
(3, 2)1
(5, 3)1
(5, 4)3
(6, 3)2
(6, 4)1
(7, 4)1
(8, 4)1
Total10
x_and_y.tabulate(normalize = True)
Value Relative Frequency
(3, 2)0.1
(5, 3)0.1
(5, 4)0.3
(6, 3)0.2
(6, 4)0.1
(7, 4)0.1
(8, 4)0.1
Total1.0

Individual pairs can be plotted in a scatter plot, which is a two-dimensional analog of a rug plot.

x_and_y.plot()

The values can be “jittered” slightly, as below, so that points do not coincide.

x_and_y.plot(jitter = True)

The two-dimensional analog of an impulse plot is a tile plot. For two discrete variables, the 'tile' plot type produces a tile plot (a.k.a. heat map) where rectangles represent the simulated pairs with their relative frequencies visualized on a color scale.

x_and_y.plot('tile')

Custom functions can be used with count to compute relative frequencies of events involving multiple random variables. Suppose we want to approximate \(\textrm{P}(X<6, Y \ge 2)\). We first define a Python function which takes as an input a pair u = (u[0], u[1]) and returns True if u[0] < 6 and u[1] >= 2.


def is_x_lt_6_and_y_ge_2(u):
  if u[0] < 6 and u[1] >= 2:
    return True
  else:
    return False

Now we can use this function along with count to find the simulated relative frequency of the event \(\{X <6, Y \ge 2\}\). Remember that x_and_y stores \((X, Y)\) pairs of values, so the first coordinate x_and_y[0] represents values of \(X\) and the second coordinate x_and_y[1] represents values of \(Y\).

x_and_y.count(is_x_lt_6_and_y_ge_2) / x_and_y.count()
0.5

We could also count use Boolean logic; basically using indicators and the property \(\textrm{I}_{\{X<6,Y\ge 2\}}=\textrm{I}_{\{X<6\}}\textrm{I}_{\{Y\ge 2\}}\).

((x_and_y[0] < 6) * (x_and_y[1] >= 2)).count_eq(True) / x_and_y.count()
0.5

Now we simulate many \((X, Y)\) pairs and summarize their frequencies.

x_and_y = (X & Y).sim(10000)

x_and_y.tabulate()
Value Frequency
(2, 1)615
(3, 2)1283
(4, 2)593
(4, 3)1265
(5, 3)1302
(5, 4)1223
(6, 3)611
(6, 4)1267
(7, 4)1241
(8, 4)600
Total10000

Here are the relative frequencies; compare with Table 2.16.

x_and_y.tabulate(normalize = True)
Value Relative Frequency
(2, 1)0.0615
(3, 2)0.1283
(4, 2)0.0593
(4, 3)0.1265
(5, 3)0.1302
(5, 4)0.1223
(6, 3)0.0611
(6, 4)0.1267
(7, 4)0.1241
(8, 4)0.06
Total1.0

When there are thousands of simulated pairs, a scatter plot does not adequately display relative frequencies, even with jittering.

x_and_y.plot(jitter = True)

The tile plot provides a better summary. Notice how the colors represent the relative frequencies in the previous table.

x_and_y.plot('tile')

Finally, we find the simulated relative frequency of the event \(\{X <6, Y \ge 2\}\).

x_and_y.count(is_x_lt_6_and_y_ge_2) / x_and_y.count()
0.5666

3.5.2 Approximating probabilities in the meeting problem

Example 3.7 Recall the independent Uniform(0, 60) model for arrival times of Example 3.3. Use the simulation results in Table 3.3 to approximate the following. (Don’t worry if the approximations are any good yet.)

  1. \(\textrm{P}(T < 15)\)
  2. \(\textrm{P}(W < 15)\)
  3. \(\textrm{P}(T < 15, W < 15)\)
  4. \(\textrm{P}(R = 20)\)
  5. \(\textrm{P}(19.5 < R < 20.5)\)

Solution 3.7.

  1. Approximate \(\textrm{P}(T < 15)\) by 3/10 = 0.3, the proportion of repetitions where the first arrival time is less than 15 minutes.
  2. Approximate \(\textrm{P}(W < 15)\) by 6/10 = 0.6, the proportion of repetitions where the waiting time is less than 15 minutes.
  3. Approximate \(\textrm{P}(T < 15, W < 15)\) by 2/10 = 0.2, the proportion of repetitions where both the first arrival time and the waiting time are less than 15 minutes.
  4. Approximate \(\textrm{P}(R=20)\) by 0, the proportion of repetitions where \(R\) is equal to 20.
  5. Approximate \(\textrm{P}(19.5 < R < 20.5)\) by 1/10 = 0.1, the proportion of repetitions where \(R\), rounded to the nearest minute, is equal to 20.

An event either happens or not. Regardless of whether an event involves a discrete or continuous random variable, the probability of the event is approximated in the same way. However, the kinds of events we’re interested in differ between those involving discrete and those involving continuous random variables. The probability that a continuous random variable is equal to any particular value is 0, so we’re not interested in approximating “equals to” probabilities for continuous random variables. When dealing with continuous random variables in practice, “equals to” is really “close to”, and we compute approximate “close to” probabilities with relative frequencies are usual. (Later, we’ll see what it means to condition on a continuous random variable being equal to some value.)

To get better approximations for the probabilities in Example 3.3 we would need to simulate many more repetitions. But the process of approximating probabilities with simulated relative frequencies is the same as in Example 3.3.

Simulation can also be used to investigate how sensitive (approximate) probabilities are to changes in assumptions.

Example 3.8 Write Symbulate code to run simulations and use the results to approximate the probability that Regina and Cady arrive with 15 minutes of each other for each of the three models in Section 3.4.

  1. Independent Uniform model
  2. Independent Normal model
  3. Bivariate Normal model

Solution 3.8. There are a few ways to code this, but one way is to define the waiting time random variable \(W=|R-Y|\) and find simulated relative frequencies of the event \(\{W<15\}\).

First the independent Uniform model.

P = Uniform(0, 60) ** 2
R, Y = RV(P)

W = abs(R - Y)

w = W.sim(10000)

w.count_lt(15) / w.count()
0.4381

Next the independent Normal model.

P = Normal(30, 10) ** 2
R, Y = RV(P)

W = abs(R - Y)

w = W.sim(10000)

w.count_lt(15) / w.count()
0.7106

Finally, the Bivariate Normal model.

P = BivariateNormal(mean1 = 30, sd1 = 10, mean2 = 30, sd2 = 10, corr = 0.7)
R, Y = RV(P)

W = abs(R - Y)

w = W.sim(10000)

w.count_lt(15) / w.count()
0.944

We see that these changes in assumptions have substantial influence on this probability. The simulations shows that they are roughly 1.7 times more likely to arrive within 15 minutes of each other under the independent Normal model than under the independent Uniform model, and roughly 1.3 times more likely under the Bivariate Normal model than under the independent Normal model.

We have previously seen rug plots of individual simulated values. Rug plot emphasize that realizations of a random variable are numbers along a number line. However, a rug plot does not adequately summarize the distribution of values. Instead, calling .plot() for simulated values of a continuous random variable produces a histogram.

plt.figure()
w.plot()
plt.show()

We will cover histograms and marginal distributions of continuous random variables in much more detail in ?sec-simulation-marginal-continuous.

3.5.3 Exercises

Exercise 3.12 The latest series of collectible Lego Minifigures contains 3 different Minifigure prizes (labeled 1, 2, 3). Each package contains a single unknown prize. Suppose we only buy 3 packages and we consider as our sample space outcome the results of just these 3 packages (prize in package 1, prize in package 2, prize in package 3). For example, 323 (or (3, 2, 3)) represents prize 3 in the first package, prize 2 in the second package, prize 3 in the third package. Let \(X\) be the number of distinct prizes obtained in these 3 packages. Let \(Y\) be the number of these 3 packages that contain prize 1.

  1. Explain how you could, in principle, conduct a simulation by hand and use the results to approximate
    1. \(\textrm{P}(X = 2)\)
    2. \(\textrm{P}(Y = 1)\)
    3. \(\textrm{P}(X = 2, Y = 1)\)
  2. Write Symbulate code to conduct a simulation and approximate the values in part 1.
  3. Write Sybulate code to conduct a simulation and approximate
    1. Marginal distribution of \(X\)
    2. Marginal distribution of \(Y\)
    3. Joint distribution of \(X\) and \(Y\)

Exercise 3.13 Repeat Exercise 3.12, but now assuming that 10% of boxes contain prize 1, 30% contain prize 2, and 60% contain prize 3.

Exercise 3.14 Maya is a basketball player who makes 40% of her three point field goal attempts. Suppose that at the end of every practice session, she attempts three pointers until she makes one and then stops. Let \(X\) be the total number of shots she attempts in a practice session. Assume shot attempts are independent, each with probability of 0.4 of being successful.

  1. Explain in words you could use simulation to approximate the distribution of \(X\) and \(\textrm{P}(X > 3)\).
  2. Write Symbulate code to approximate the distribution of \(X\) and \(\textrm{P}(X > 3)\).

Exercise 3.15 Consider a continuous version of the dice rolling problem where instead of rolling two fair four-sided dice (which return values 1, 2, 3, 4) we spin twice a Uniform(1, 4) spinner (which returns any value in the continuous range between 1 and 4). Let \(X\) be the sum of the two spins and let \(Y\) be the larger of the two spins.

  1. Describe in words how you could use simulation to approximate
    1. \(\textrm{P}(X < 3.5)\)
    2. \(\textrm{P}(Y > 2.7)\)
    3. \(\textrm{P}(X < 3.5, Y > 2.7)\)
  2. Write Symbulate code to conduct a simulation to approximate, via plots
    1. the marginal distribution of \(X\)
    2. the marginal distribution of \(Y\)
    3. the joint distribution of \(X\) and \(Y\)
    4. The probabilities from part 1

3.6 Approximating probabilities: Simulation margin of error

The probability of an event can be approximated by simulating the random phenomenon a large number of times and computing the relative frequency of the event. After enough repetitions we expect the simulated relative frequency to be close to the true probability, but there probably won’t be an exact match. Therefore, in addition to reporting the approximate probability, we should also provide a margin of error which indicates how close we think our simulated relative frequency is to the true probability.

Section 1.2.1 introduced the relative frequency interpretation in the context of flipping a fair coin. After many flips of a fair coin, we expect the proportion of flips resulting in H to be close to 0.5. But how many flips is enough? And how “close” to 0.5? We’ll investigate these questions now.

Consider Figure 3.12 below. Each dot represents a set of 10,000 fair coin flips. There are 100 dots displayed, representing 100 different sets of 10,000 coin flips each. For each set of flips, the proportion of the 10,000 flips which landed on head is recorded. For example, if in one set 4973 out of 10,000 flips landed on heads, the proportion of heads is 0.4973. The plot displays 100 such proportions; similar values have been “binned” together for plotting. We see that 98 of these 100 proportions are between 0.49 and 0.51, represented by the blue dots. So if “between 0.49 and 0.51” is considered “close to 0.5”, then yes, in 10000 coin flips we would expect21 the proportion of heads to be close to 0.5.

Figure 3.12: Proportion of flips which are heads in 100 sets of 10,000 fair coin flips. Each dot represents a set of 10,000 fair coin flips. In 98 of these 100 sets the proportion of heads is between 0.49 and 0.51 (the blue dots).

Our discussion of Figure 3.12 suggests that 0.01 might be an appropriate margin of error for a simulation based on 10,000 flips. Suppose we perform a simulation of 10000 flips with 4973 landing on heads. We could say “we estimate that the probability that a coin land on heads is equal to 0.4973”. But such a precise estimate would almost certainly be incorrect, due to natural variability in the simulation. In fact, only 1 sets22 resulted in a proportion of heads exactly equal to the true probability of 0.5.

A better statement would be “we estimate that the probability that a coin land on heads is 0.4973 with a margin of error23 of 0.01”. This means that we estimate that the true probability of heads is within 0.01 of 0.4973. In other words, we estimate that the true probability of heads is between 0.4873 and 0.5073, an interval whose endpoints are \(0.4973 \pm 0.01\). This interval estimate is “accurate” in the sense that the true probability of heads, 0.5, is between 0.4873 and 0.5073. By providing a margin of error, we have sacrificed a little precision—“equal to 0.4973” versus “within 0.01 of 0.4973”—to achieve greater accuracy.

Let’s explore this idea of “accuracy” further. Recall that Figure 3.12 displays the proportion of flips which landed on heads for 100 sets of 10000 flips each. Suppose that for each of these sets we form an interval estimate of the probability that the coin lands on heads by adding/subtracting 0.01 from the simulated proportion, as we did for \(0.4973 \pm 0.01\) in the previous paragraph. Figure 3.13 displays the results. Even though the proportion of heads was equal to 0.5 in only 1 sets, in 98 of these 100 sets (the blue intervals) the corresponding interval contains 0.5, the true probability of heads. For almost all of the sets, the interval formed via “relative frequency \(\pm\) margin of error” provides an accurate estimate of the true probability. However, not all the intervals contain the true probability, which is why we often qualify that our margin of error is for “95% confidence” or “95% accuracy”. We will see more about “confidence” soon. In any case, the discussion so far, and the results in Figure 3.12 and Figure 3.13, suggest that 0.01 is a reasonable choice for margin of error when estimating the probability that a coin lands on heads based on 10000 flips.

Figure 3.13: Interval estimates of the probability of heads based on 100 sets of 10,000 fair coin flips. Each dot represents the proportion of heads in a set of 10,000 fair coin flips. (The sets have been sorted based on their proportion of heads.) For each set an interval is obtained by adding/subtracting the margin of error of 0.01 from the proportion of heads. In 98 of these 100 sets (the blue dots/intervals) the corresponding interval contains the true probability of heads (0.5, represented by the vertical black line).

What if we want to be stricter about what qualifies as “close to 0.5”? That is, what if a margin of error of 0.01 isn’t good enough? You might suspect that with even more flips we would expect to observe heads on even closer to 50% of flips. Indeed, this is the case. Figure 3.14 displays the results of 100 sets of 1,000,000 fair coin flips. The pattern seems similar to Figure 3.12 but pay close attention to the horizontal axis which covers a much shorter range of values than in the previous figure. Now 91 of the 100 proportions are between 0.499 and 0.501. So in 1,000,000 flips we would expect24 the proportion of heads to be between 0.499 and 0.501, pretty close to 0.5. This suggests that 0.001 might be an appropriate margin of error for a simulation based on 1,000,000 flips.

Figure 3.14: Proportion of flips which are heads in 100 sets of 1,000,000 fair coin flips. Each dot represents a set of 1,000,000 fair coin flips. In 91 of these 100 sets the proportion of heads is between 0.499 and 0.501 (the blue dots).

What about even more flips? In Figure 3.15 each dot represents a set of 100 million flips. The pattern seems similar to the previous figures, but again pay close attention the horizontal access which covers a smaller range of values. Now 96 of the 100 proportions are between 0.4999 and 0.5001. So in 100 million flips we would expect25 the proportion of heads to be between 0.4999 and 0.5001, pretty close to 0.5. This suggests that 0.0001 might be an appropriate margin of error for a simulation based on 100,000,000 flips.

Figure 3.15: Proportion of flips which are heads in 100 sets of 100,000,000 fair coin flips. Each dot represents a set of 100,000,000 fair coin flips. In 96 of these 100 sets the proportion of heads is between 0.4999 and 0.5001 (the blue dots).

The previous figures illustrate that the more flips there are, the more likely it is that we observe a proportion of flips landing on heads close to 0.5. We also see that with more flips we can refine our definition of “close to 0.5”: increasing the number of flips by a factor of 100 (10,000 to 1,000,000 to 100,000,000) seems to give us an additional decimal place of precision (\(0.5\pm0.01\) to \(0.5\pm 0.001\) to \(0.5\pm 0.0001\).)

3.6.1 A closer look at margin of error

We will now carry out an analysis similar to the one above to investigate simulation margin of error and how it is influenced by the number of simulated values used to compute the relative frequency. Continuing the dice example, suppose we want to estimate \(p=\textrm{P}(X=6)\), the probability that the sum of two rolls of a fair four-sided equals six. Since there are 16 possible equally likely outcomes, 3 of which result in a sum of 3, the true probability is \(p=3/16=0.1875\).

We will perform a “meta-simulation”. The process is as follows

  1. Simulate two rolls of a fair four-sided die. Compute the sum (\(X\)) and see if it is equal to 6.
  2. Repeat step 1 to generate \(N\) simulated values of the sum (\(X\)). Compute the relative frequency of sixes: count the number of the \(N\) simulated values equal to 6 and divide by \(N\). Denote this relative frequency \(\hat{p}\) (read “p-hat”).
  3. Repeat step 2 a large number of times, recording the relative frequency \(\hat{p}\) for each set of \(N\) values.

Be sure to distinguish between steps 2 and 3. A simulation will typically involve just steps 1 and 2, resulting in a single relative frequency based on \(N\) simulated values. Step 3 is the “meta” step; we see how this relative frequency varies from simulation to simulation to help us in determing an appropriate margin of error. The important quantity in this analysis is \(N\), the number of independently simulated values used to compute the relative frequency in a single simulation. We wish to see how \(N\) impacts margin of error. The number of simulations in step 3 just needs to be “large” enough to provide a clear picture of how the relative frequency varies from simulation to simulation. The more the relative frequency varies from simulation to simulation, the larger the margin of error needs to be.

We can combine steps 1 and 2 of the meta-simulation to put it in the framework of the simulations from earlier in this chapter. Namely, we can code the meta-simulation as a single simulation in which

  • A sample space outcome represents \(N\) values of the sum of two fair-four sided dice
  • The main random variable of interest is the proportion of the \(N\) values which are equal to 6.

Let’s first consider \(N=100\). The following Symbulate code defines the probability space corresponding to 100 values of the sum of two-fair four sided dice. Notice the use of apply which functions much in the same way26 as RV.

N = 100
P = BoxModel([1, 2, 3, 4], size = 2).apply(sum) ** N
P.sim(5)
Index Result
0(5, 7, 7, 4, 6, ..., 4)
1(4, 4, 5, 5, 5, ..., 6)
2(5, 5, 5, 6, 5, ..., 5)
3(6, 5, 4, 6, 5, ..., 6)
4(5, 3, 6, 6, 2, ..., 6)

In the code above

  • BoxModel([1, 2, 3, 4], size = 2) simulates two rolls of a fair four-sided die
  • .apply(sum) computes the sum of the two rolls
  • ** N repeats the process N times to generate a set of N independent values, each value representing the sum of two rolls of a fair four-sided die
  • P.sim(5) simulates 5 sets, each set consisting of N sums

Now we define the random variable which takes as an input a set of \(N\) sums and returns the proportion of the \(N\) sums which are equal to six.

phat = RV(P, count_eq(6)) / N
phat.sim(5)
Index Result
00.17
10.17
20.15
30.18
40.25

In the code above

  • phat is an RV defined on the probability space P. Recall that an outcome of P is a set of N sums (and each sum is the sum of two rolls of a fair four-sided die).
  • The function that defines the RV is count.eq(6), which counts the number of values in the set that are equal to 6. We then27 divide by N, the total number of values in the set, to get the relative frequency. (Remember that a transformation of a random variable is also a random variable.)
  • phat.sim(5) generates 5 simulated values of the relative frequency phat. Each simulated value of phat is the relative frequency of sixes in N sums of two rolls of a fair four-sided die.

Now we simulate and summarize a large number of values of phat. We’ll simulate 100 values for illustration (as we did in Figure 3.12, Figure 3.14, and Figure 3.15). Be sure not to confuse 100 with N. Remember, the important quantity is N, the number of simulated values used in computing each relative frequency.

phat.sim(100).plot(type = "impulse", normalize = False)
plt.ylabel('Number of simulations');

We see that the 100 relative frequencies are roughly centered around the true probability 0.1875, but there is variability in the relative frequencies from simulation to simulation. From the range of values, we see that most relative frequencies are within about 0.08 or so from the true probability 0.1875. So a value around 0.08 seems like a reasonable value of the margin of error, but the actual value depends on what we mean by “most”. We can get a clearer picture if we run more simulations. The following plot displays the results of 10000 simulations, each resulting in a value of \(\hat{p}\). Remember that each relative frequency is based on \(N=100\) sums of two rolls.

phats = phat.sim(10000)
phats.plot(type = "impulse", normalize = False)
plt.ylabel('Number of simulations');

Let’s see how many of these 10000 simulated proportions are within 0.08 of the true probability 0.1875.

1 - (phats.count_lt(0.1875 - 0.08) + phats.count_gt(0.1875 + 0.08)) / 10000
0.9609

In roughly 95% or so of simulations, the simulated relative frequency was within 0.08 of the true probability. So 0.08 seems like a reasonable margin of error for “95% confidence” or “95% accuracy”. However, a margin of error of 0.08 yields pretty imprecise estimates, ranging from about 0.10 to 0.27. Can we keep the degree of accuracy at 95% but get a smaller margin of error, and hence a more precise estimate? Yes, if we increase the number of repetitions used to compute the relative frequency.

Now we repeat the analysis, but with \(N=10000\). In this case, each relative frequency is computed based on 10000 independent values, each value representing a sum of two rolls of a fair four-sided die. As before we start with 100 simulated relative frequencies.

N = 10000
P = BoxModel([1, 2, 3, 4], size = 2).apply(sum) ** N
phat = RV(P, count_eq(6)) / N

phats = phat.sim(100)
phats.plot(type = "impulse", normalize = False)
plt.ylabel('Number of simulations');

Again we see that the 100 relative frequencies are roughly centered around the true probability 0.1875, but there is less variability in the relative frequencies from simulation to simulation for \(N=10000\) than for \(N=100\). Pay close attention to the horizontal axis. From the range of values, we see that most relative frequencies are now within about 0.008 of the true probability 0.1875.

1 - (phats.count_lt(0.1875 - 0.008) + phats.count_gt(0.1875 + 0.008)) / 100
0.95

As with \(N=100\), running more than 100 simulations would give a clearer picture of how much the relative frequency based on \(N=10000\) simulated values varies from simulation to simulation. But even with just 100 simulations, we see that a margin of error of about 0.008 is required for roughly 95% accuracy when \(N=10000\), as opposed to 0.08 when \(N=100\). As we observed in the coin flipping example earlier in this section, it appears that increasing \(N\) by a factor of 100 yields an extra decimal place of precision. That is, increasing \(N\) by a factor of 100 decreases the margin of error by a factor of \(\sqrt{100}\). In general, the margin of error is inversely related to \(\sqrt{N}\).

Table 3.9: Comparison of margins of error for 95% confidence for the meta-simulations in this section.
Probability that True value 95% m.o.e. (N = 100) 95% m.o.e. (N = 10000) 95% m.o.e. (N = 1000000)
A fair coin flip lands H 0.5000 0.1000 0.0100 0.0010
Two rolls of a fair four-sided die sum to 6 0.1875 0.0781 0.0078 0.0008

The two examples in this section illustrate that the margin of error also depends somewhat on the true probability. The margin of error required for 95% accuracy is larger when the true probability is 0.5 than when it is 0.1875. It can be shown that when estimating a probability \(p\) with a relative frequency based on \(N\) simulated repetitions, the margin of error required for 95% confidence28 is \[ 2\frac{\sqrt{p(1-p)}}{\sqrt{N}} \] For a given \(N\), the above quantity is maximized when \(p\) is 0.5. Since \(p\) is usually unknown—the reason for performing the simulation is to approximate it—we plug in 0.5 for a somewhat conservative margin of error of \(1/\sqrt{N}\).

For a fixed \(N\), there is a tradeoff between accuracy and precision. The factor 2 in the margin of error formula above corresponds to 95% accuracy. Greater accuracy would require a larger factor, and a larger margin of error, resulting in a wider—that is, less precise—interval. For example, 99% confidence requires a factor of roughly 2.6 instead of 2, resulting in an interval that is roughly 30 percent wider. The confidence level does matter, but the primary influencer of margin of error is \(N\), the number of independently simulated repetitions on which the relative frequency is based. Regardless of confidence level, the margin of error is on the order of magnitude of \(1/\sqrt{N}\).

In summary, the margin of error when approximating a probability based on a simulated relative frequency is roughly on the order \(1/\sqrt{N}\), where \(N\) is the number of independently simulated repetitions used to calculate the relative frequency. Warning: alternative methods are necessary when the actual probability being estimated is very close to 0 or to 1.

A probability is a theoretical long run relative frequency. A probability can be approximated by a relative frequency from a large number of simulated repetitions, but there is some simulation margin of error. Likewise, the average value of \(X\) after a large number of simulated repetitions is only an approximation to the theoretical long run average value of \(X\), that is, the expected value of \(X\). The margin of error is also on the order of \(1/\sqrt{N}\) where \(N\) is the number of independently simulated values used to compute the average, but the degree of variability of the random variable also plays a role. We will explore margins of error for long run averages in more detail later.

Pay attention to what \(N\) denotes: \(N\) is the number of independently29 simulated repetitions used to calculate the relative frequency. This is not necessarily the number of simulated repetitions. For example, suppose we use simulation to approximate the conditional probability that the larger of two rolls of a fair four-sided die is 4 given that the sum is equal to 6. We might start by simulating 10000 pairs of rolls. But the sum would be equal to 6 in only about 1875 pairs, and it is only these pairs that would be used to compute the relative frequency that the larger roll is 4 to approximate the probability of interest. The appropriate margin of error is roughly \(1/\sqrt{1875} \approx 0.023\). Compared to 0.01 (based on the original 10000 repetitions), the margin of error of 0.023 for the conditional probability results in intervals that are 130 percent wider. Carefully identifying the number of repetitions used to calculate the relative frequency is especially important when determining appropriate simulation margins of error for approximating conditional probabilities, which we’ll discuss in more detail later.

3.6.2 Approximating multiple probabilities

When using simulation to estimate a single probability, the primary influencer of margin of error is \(N\), the number of repetitions on which the relative frequency is based. It doesn’t matter as much whether we use, say, 95% versus 99% confidence (either way, it’s an “A”). That is, it doesn’t matter too much whether we compute our margin of error using \[ 2\frac{\sqrt{p(1-p)}}{\sqrt{N}}, \] with a multiple of 2 for 95% confidence, or if we replace 2 by 2.6 for 99% confidence. (Remember, we can plug in 0.5 for the unknown \(p\) for a conservative margin of error.) A margin of error based on 95% or 99% (or another confidence level in the neighborhood) provides a reasonably accurate estimate of the probability. However, using simulation to approximate multiple probabilities simultaneously requires a little more care with the confidence level.

In the previous section we used simulation to estimate \(\textrm{P}(X=6)\). Now suppose we want to approximate \(\textrm{P}(X = x)\) for each value of \(x = 2,3,4,5,6,7,8\). We could run a simulation to obtain results like those in Figure 3.11 and the table before it. Each of the relative frequencies in the table is an approximation of the true probability, and so each of the relative frequencies should have a margin of error, say 0.01 for a simulation based on 10000 repetitions. Thus, the simulation results yield a collection of seven interval estimates, an interval estimate of \(\textrm{P}(X = x)\) for each value of \(x = 2,3,4,5,6,7,8\). Each interval in the collection either contains the respective true probability or not. The question is then: In what percent of simulations will every interval in the collection contain the respective true probability?

Figure 3.16 summarizes the results of 100 simulations. Each simulation consists of 10000 repetitions, with results similar to those in Figure 3.11 and the table before it. Each simulation is represented by a row in Figure 3.16, consisting of seven 95% interval estimates, one for each value of \(x\). (The rows have been sorted before plotting; see the next paragraph.) Each panel represents a different value of \(x\); for each value of \(x\), around 95 out of the 100 simulations yield estimates that contain the true probability \(\textrm{P}(X=x)\), represented by the vertical line.

Figure 3.16: Results of 100 simulations. Each simulation yields a collection of seven 95% confidence intervals.

However, let’s zoom in on the bottom of Figure 3.16. Figure 3.17 displays the results of the 21 simulations at the bottom of Figure 3.16. Look carefully row by row in Figure 3.17; in each of these simulations at least one of the seven intervals in the collection does not contain the true probability. In other words, every interval in the collection contains the respective true probability in only 79 of the 100 simulations (the other simulations in Figure 3.16).) While we have 95 percent confidence in our interval estimate of \(\textrm{P}(X = x)\) for any single \(x\), we only have around 79 percent confidence in our approximate distribution of \(X\). Our confidence “grade” has gone from “A” range (95 percent) to “C” range (79 percent).

Figure 3.17: Subset of 21 simulations from Figure 3.16. In each of these simulations, at least one 95% confidence interval does not contain the respective true probability.

When approximating multiple probabilities based on a single simulation, such as when approximating a distribution, margins of error and interval estimates need to be adjusted to obtain simultaneous 95% confidence. The easiest way to do this is to make all of the intervals in the collection wider.

There are many procedures for adjusting a collection of interval estimates to achieve simultaneous confidence; we won’t get into any technical details. As a very rough, but simple and typically conservative rule, when approximating many probabilities based on a single simulation, we recommend making the margin of error twice as large as when approximating a single probability. That is, use a margin of error of \(2/\sqrt{N}\) (rather than \(1/\sqrt{N}\)) to achieve simultaneous 95% confidence when approximating many probabilities based on a single simulation.

Figure 3.18 displays the same simulation results as Figure 3.16, but now the margin of error for each confidence interval is 2 times greater. We can see that all of the original “bad” intervals turn “good” when widened, and now there is (greater than) 95% simultaneous confidence.

Figure 3.18: Results of 100 simulations. Each simulation yields a collection of seven 95% confidence intervals. The margin of error has been adjusted to obtain simultaneous confidence.

3.6.3 Beware a false sense of precision

Why don’t we always run something like one trillion repetitions so that our margin of error is tiny? There is a cost to simulating and storing more repetitions in terms of computational time and memory. Also, remember that simulating one trillion repetitions doesn’t guarantee that the margin of error is actually based on anywhere close to one trillion repetitions, especially when conditioning on a low probability event.

Most importantly, keep in mind that any probability model is based on a series of assumptions and these assumptions are not satisfied exactly by the random phenomenon. A precise estimate of a probability under the assumptions of the model is not necessarily a comparably precise estimate of the true probability. Reporting probability estimates out to many decimal places conveys a false sense of precision and should typically be avoided.

For example, the probability that any particular coin lands on heads is probably not 0.5 exactly. But any difference between the true probability of the coin landing on heads and 0.5 is likely not large enough to be practically meaningful30. That is, assuming the coin is fair is a reasonable model.

Suppose we assume that the probability that a coin lands on heads is exactly 0.5 and that the results of different flips are independent. If we flip the coin 1000 times the probability that it lands on heads at most 490 times is 0.2739864. (We will see a formula for computing this value later.) If we were to simulate one trillion repetitions (each consisting of 1000 flips) to estimate this probability then our margin of error would be 0.000001; we could expect accuracy out to the sixth decimal place. However, reporting the probability with so many decimal places is somewhat disingenuous. If the probability that the coin lands on heads were 0.50001, then the probability of at most 490 heads in 1000 flips would be 0.2737757. If the probability that the coin lands on heads were 0.5001, then the probability of at most 490 heads in 1000 flips would be 0.2718834. Reporting our approximate probability as something like 0.2739864 \(\pm\) 0.000001 says more about the precision in our assumption that the coin is fair than it does about the true probability that the coin lands on heads at most 490 times in 1000 flips. A more honest conclusion would result from running 10000 repetitions and reporting our approximate probability as something like 0.27 \(\pm\) 0.01. Such a conclusion reflects more genuinely that there’s some “wiggle room” in our assumptions, and that any probability computed according to our model is at best a reasonable approximation of the “true” probability.

For most of the situations we’ll encounter in this book, estimating a probability to within 0.01 of its true value will be sufficient for practical purposes, and so basing approximations on 10000 independently simulated values will be appropriate. Of course, there are real situations where probabilities need to be estimated much more precisely, e.g., the probability that a bridge will collapse. Such situations require more intensive methods.

Example 3.9 Donny Dont runs a simulation to approximate the probability of an event \(A\). In 1,000,000 repetitions, event \(A\) occurs on 437,952 repetitions. He computes a margin of error (for 95% confidence) of \(1/\sqrt{1000000} = 0.001\). Donny tries a few different ways of reporting his results. Explain to him what is wrong, missing, or could be improved in each of the following.

  1. I estimate that \(\textrm{P}(A)\) is 0.437952.
  2. I estimate that \(\textrm{P}(A)\) is 43.7952% with a margin of error of 0.1%.
  3. I estimate that \(\textrm{P}(A)\) is between 0.436952 and 0.438952.
  4. I estimate with 95% confidence that under these assumptions \(\textrm{P}(A)\) is between 0.437 and 0.439.

Solution 3.9.

  1. Donny has not provided a margin of error.
  2. We’re okay with Donny expressing the proportion as a percent. However, 0.001 is technically a margin of error of 0.1 percentage points. A margin of error of 0.1% implies a percentage change, leading to interval endpoints of \(0.437952\times (1 - 0.001) = 0.437514\) and \(0.437952\times (1 + 0.001) = 0.438390\). With a margin of error of 0.001, the interval endpoints are really \(0.437952 - 0.001 = 0.436952\) and \(0.437952 + 0.001 = 0.438952\). The numbers are fairly similar in this case, but in general, the margins of error we have discussed are additive and therefore should be reported as percentage points.
  3. We don’t know the context, but in many situations reporting estimates to six decimal places as Donny has done is not necessary and could convey a false sense of precision.
  4. We like Donny’s statement! He has provided an interval estimate, reported his level of confidence, rounded to an appropriate number of decimal places (in most contexts) so as not to convey a false sense of precision, and drawn attention to to the fact that the estimate of the probability is predicated on the model assumptions. It would have also been fine if he had replaced “is between 0.437 and 0.439” with “is 0.438 with a margin of error of 0.001”. In practice, Donny should replace vague statements like “these assumptions” and “\(\textrm{P}(A)\)” with contextual details, but the general form of his answer is very good.

3.6.4 Exercises

Exercise 3.16 Use simulation to approximate the probability that at least two people in a group of \(n=30\) share a birthday in Example 2.48. Compute the margin of error and write a clearly worded sentence reporting your approximate probability in context.

3.7 Approximating expected values: Averages

On any single repetition of a simulation a particular event either occurs or not. Summarizing simulation results for events involves simply counting the number of repetitions on which the event occurs and finding related proportions.

On the other hand, random variables typically take many possible values over the course of many repetitions. We are still interested in relative frequencies of events, like \(\{X=6\}\), \(\{Y \ge 3\}\), and \(\{X > 5, Y \ge 3\}\). But for random variables we are also interested in their distributions which describe the possible values that the random variables can take and their relative likelihoods. A marginal distribution contains all the information about the pattern of variability of a single random variable alone, and a joint distribution contains all the information about the pattern of variability of a collection of random variables. It is also useful to summarize some key features of the pattern of variability.

One summary characteristic of a marginal distribution of a random variable is the expected value. In Section 1.7 and Section 2.5.3 we discussed how an expected value can be interpreted as the long run average value of a random variable. We can approximate the long run average value by simulating many values of the random variable and computing the average (a.k.a. mean) in the usual way: sum all the simulated values and divide by the number of simulated values.

Example 3.10 Let \(X\) be the sum of two rolls of a fair four-sided die, and let \(Y\) be the larger of the two rolls (or the common value if a tie). Recall your tactile simulation from Example 3.1; see ours in Table 3.1. Based only on the results of your simulation, approximate the long run average value of each of the following. (Don’t worry if the approximations are any good yet.)

  1. \(X\)
  2. \(Y\)
  3. \(X^2\)
  4. \(XY\)

Solution 3.10. We reproduce the results of our simulation in Table 3.10 with additional columns for \(X^2\) and \(XY\). Results vary naturally so your simulation results will be different, but the same ideas apply.

Table 3.10: Results of 10 repetitions of two rolls of a fair four-sided die
Results of 10 repetitions of two rolls of a fair four-sided die
Repetition First roll Second roll X Y X^2 XY
1 2 1 3 2 9 6
2 1 1 2 1 4 2
3 3 3 6 3 36 18
4 4 3 7 4 49 28
5 3 2 5 3 25 15
6 3 4 7 4 49 28
7 2 3 5 3 25 15
8 2 4 6 4 36 24
9 1 2 3 2 9 6
10 3 4 7 4 49 28
  1. Approximate the long run average value of \(X\) by summing the 10 simulated values of \(X\) and dividing by 10. \[ \frac{3 + 2 + 6 + 7 + 5 + 7 + 5 + 6 + 3 + 7}{10} = 5.1 \]
  2. Approximate the long run average value of \(Y\) by summing the 10 simulated values of \(Y\) and dividing by 10. \[ \frac{2 + 1 + 3 + 4 + 3 + 4 + 3 + 4 + 2 + 4}{10} = 3 \]
  3. First, for each repetition square the value of \(X\) to obtain the \(X^2\) column. Then approximate the long run average value of \(X^2\) by summing the 10 simulated values of \(X^2\) and dividing by 10. \[ \frac{9 + 4 + 36 + 49 + 25 + 49 + 25 + 36 + 9 + 49}{10} = 29.1 \]
  4. First, for each repetition compute the product \(XY\) to obtain the \(XY\) column. Then approximate the long run average value of \(XY\) by summing the 10 simulated values of \(XY\) and dividing by 10. \[ \frac{6 + 2 + 18 + 28 + 15 + 28 + 15 + 24 + 6 + 28}{10} = 17 \]

Of course, 10 repetitions is not enough to reliably approximate the long run average value. But whether the average is based on 10 values or 10 million, an average is computed in the usual way: sum the values and divide by the number of values.

Example 3.11 Recall Section 3.2.2 where we assumed the arrival time \(X\) (minutes after noon) followed a Normal(30, 10) distribution, represented by the spinner in Figure 3.7. Table 3.5 displays 10 simulated values of \(X\). Based only on the results of this simulation, approximate the long run average value of \(X\). (Don’t worry if the approximation is any good yet.)

Solution 3.11. Approximate the long run average value of \(X\) by simply computing the average in the usual way: sum the 10 simulated values of \(X\) and divide by 10. It’s just that when \(X\) is continuous each simulated value will be distinct with lots of decimal places.

\[ {\scriptscriptstyle \frac{21.387... + 50.719... + 24.09... + 32.68... + 54.511... + 38.117... + 16.819... + 22.98... + 28.867... + 43.43}{10} = 33.36 } \]

We have seen a few examples of how to compute expected values for discrete random variables as probability-weighted average values. The computation of expected values for continuous random variables is more complicated. However, expected values can also be interpreted as long run average values for continuous random variables. Therefore, expected values can be approximated in the same way for discrete and continuous random variables: simulate lots of values and compute the average.

In the examples in the section we only considered 10 simulated repetitions to emphasize that we’re simply computing an average in the usual way. But to get a good approximation of an expected value, we’ll need to simulate many more values and average. We’ll soon explore what happens to the average as we simulate more values, but first a caution about averages of transformations.

3.7.1 Averages of transformations

Example 3.12 Donny Don’t says: “In Example 3.10, why bother creating columns for \(X^2\) and \(XY\)? If I want to find the average value of \(X^2\) I can just square the average value of \(X\). For the average value of \(XY\) I can just multiply the average value of \(X\) and the average value of \(Y\).” Do you agree? (Check to see if this works for your simulation results.) If not, explain why not.

Solution 3.12. It is easy to check that Donny has made a mistake just by inspecting the simulation results: 5.12 \(\neq\) 29.1, 5.1 \(\times\) 3 \(\neq\) 17. To see why, suppose we had just performed two repetitions, resulting in the first two rows of Table 3.10.

\[ \text{Average of $X^2$} = \frac{3^2 + 2^2}{2} =6.5 \neq 6.25= \left(\frac{3 + 2}{2}\right)^2=(\text{Average of $X$})^2 \]

Squaring first and then averaging (which yields 6.5) is not the same as averaging first and then squaring (which yields 6.25), essentially because \((3+2)^2\neq 3^2 + 2^2\).

Similarly, \[ {\small \text{Average of $XY$} = \frac{(3)(2) + (2)(1)}{2} =4 \neq 3.75= \left(\frac{3 + 2}{2}\right)\left(\frac{2 + 1}{2}\right)=(\text{Average of $X$})\times (\text{Average of $Y$}) } \] Multiplying first and then averaging (which yields 4) is not the same as averaging first and then multiplying (which yields 3.75), essentially because \((3)(2)+(2)(1)\neq(3+2)(2+1)\).

We often transform random variables to obtain other random variables, e.g. \(X\) to \(g(X)\). To obtain the average of the transformed variable, we cannot simply plug the average of the original variable into the transformation formula. Instead, we need to transform the values of the variable first and then average the transformed values. In general the order of transforming and averaging is not interchangeable. Whether in the short run or the long run, in general \[\begin{align*} \text{Average of $g(X)$} & \neq g(\text{Average of $X$})\\ \text{Average of $g(X, Y)$} & \neq g(\text{Average of $X$}, \text{Average of $Y$}) \end{align*}\]

In terms of expected value, in general \[\begin{align*} \textrm{E}(g(X)) & \neq g(\textrm{E}(X))\\ \textrm{E}(g(X, Y)) & \neq g(\textrm{E}(X), \textrm{E}(Y)) \end{align*}\]

Warning

Be careful! Many common mistakes in probability result from not heeding this general principle about averages (expected values) of transformations.

3.7.2 Long run averages

We have investigated long run relative frequencies; see Section 1.2.1 and Section 3.6 in particular. Now let’s see what happens to averages in the long run. Continuing Example 3.10 let \(X\) be the sum of two rolls of a fair four-sided die. Table 3.11 displays the results of 10 pairs of rolls of a fair four-sided die. The first column is the repetition number (first pair, second pair, and so on) and the second column represents \(X\), the sum of the two rolls. The third column displays the running sum of \(X\) values, and the fourth column the running average of \(X\) values. Figure 3.19 displays the running average as a function of the number of repetitions. Of course, the results depend on the particular sequence of rolls. We encourage you to roll the dice and construct your own table and plot.

Table 3.11: Results and running average of \(X\), the sum of two rolls of a fair four-sided die.
Repetition Value of X Running sum of X Running average of X
1 3 3 3.000
2 2 5 2.500
3 6 11 3.667
4 7 18 4.500
5 5 23 4.600
6 7 30 5.000
7 5 35 5.000
8 6 41 5.125
9 3 44 4.889
10 7 51 5.100
Figure 3.19: Running average of \(X\) for the 10 pairs of rolls in Table 3.11.

Now we’ll perform 90 more repetitions for a total of 100. Figure 3.20 (a) summarizes the results, while Figure 3.20 (b) also displays the results for 3 additional simulations of 100 pairs of rolls. The running average fluctuates considerably in the early stages, but settles down and tends to get closer to 5 as the number of repetitions increases. However, each of the four simulations results in a different average of \(X\) after 100 pairs of rolls: 4.85 (gray), 4.9 (orange), 5.25 (blue), 4.79 (green). Even after 100 pairs of rolls the running average of \(X\) still fluctuates from simulation to simulation.

(a) A single simulation of 100 pairs of rolls
(b) Four simulations, each of 100 pairs of rolls
Figure 3.20: Running average of \(X\), the sum of two rolls of a fair four-sided die, for four simulations of 100 pairs of rolls.

Now we’ll add 900 more repetitions for a total of 1000 in each simulation. Figure 3.21 (a) summarizes the results for our original set and Figure 3.21 (b) also displays the results for three additional simulations. Again, the running average of \(X\) fluctuates considerably in the early stages, but settles down and tends to get closer to 5 as the number of repetitions increases. Compared to the results after 100 repetitions, there is less variability between simulations in the running average of \(X\) after 1000 repetitions: 4.99 (gray), 4.964 (orange), 5.006 (blue), 4.933 (green). Now, even after 1000 repetitions the running average of \(X\) isn’t guaranteed to be exactly 5, but we see a tendency for the running average of \(X\) to get closer to 5 as the number of repetitions increases.

(a) A single simulation of 1000 pairs of rolls
(b) Four simulations, each of 1000 pairs of rolls
Figure 3.21: Running average of \(X\), the sum of two rolls of a fair four-sided die, for four simulations of 1000 pairs of rolls.

The marginal distribution of \(X\), depicted in Figure 2.16, shows that 5 is the “balance point” of the distribution. In Example 2.44 we saw that 5 is the probability-weighted average value of \(X\), and we interpreted 5 as the value of \(X\) we would “expect” to see on average in the long run. In this sense, 5 is the “true” long run average value of \(X\), and our simulation results agree. We will discuss later “true” long run average values or “expected values” in much more detail later. For now, we’ll rely on simulation: we can approximate the long run average value of a random variable \(X\) by simulating many values of \(X\) and finding the average in the usual way.

Recall that a probability is a theoretical long run relative frequency. A probability can be approximated by a relative frequency from a large number of simulated repetitions, but there is some simulation margin of error.

Likewise, the average value of \(X\) after a large number of simulated repetitions is only an approximation to the theoretical long run average value of \(X\), and there is margin of error due to natural variability in the simulation. The margin of error is also on the order of \(1/\sqrt{N}\) where \(N\) is the number of independently simulated values used to compute the average. However, the degree of variability of the random variable itself also influences the margin of error when approximating long run averages. In particular, if \(\sigma\) is the standard deviation (see Section 3.8) of the random variable, then the margin of error for the average is on the order of \(\sigma / \sqrt{N}\).

Remember that the long run average value is just one feature of a marginal distribution. There is much more to the long run pattern of variability of a random variable than just its average value. We are also interested in percentiles, degree of variability, and quantities that measure relationships between random variables. Two random variables can have the same long run average value but very different distributions. For example, the average temperature in both Phoenix, AZ and Miami, FL is around 75 degrees F, but the distribution of temperatures is not the same.

3.7.3 Averages in Symbulate

Continuing Example 3.10 let \(X\) be the sum of two rolls of a fair four-sided die. In Symbulate, we first simulate and store 10000 values of \(X\).

P = BoxModel([1, 2, 3, 4], size = 2)

X = RV(P, sum)

x = X.sim(10000)

We can approximate the long run average value of \(X\) by computing the average—a.k.a., mean—of the 10000 simulated values in the usual way: sum the 10000 simulated values stored in x and divide by 10000. Here are a few ways of computing the mean of the simulated values.

x.sum() / 10000
5.0158
x.sum() / x.count()
5.0158
x.mean()
5.0158

Averages are computed the same way, regardless of whether the variable is discrete or continuous. The following code computes the average of 5 then 10000 simulated values of \(X\), a random variable with the Normal(30, 10) distribution from Section 3.2.2.

X = RV(Normal(30, 10))

x = X.sim(5)

x
Index Result
021.166077511220507
126.8591776741059
232.209870337674516
339.17367311435916
431.97525061804284
x.mean()
30.27680985108059
x = X.sim(10000)

x
Index Result
043.62793154836135
136.943210494704836
228.789403266604605
323.566887514516907
421.926223254242437
524.693852628423915
646.08960228736697
737.225219096844036
838.70180176508012
......
999932.57553162589599
x.mean()
30.16143846704617

Our simulation suggests that the long run average of a random variable with a Normal(30, 10) distribution is approximately 30. In general, one parameter of a Normal distribution represents its expected value, a.k.a., (long run) mean; the other parameter represents its standard deviation, which we will discuss soon (in Section 3.8).

3.7.4 Linearity of averages

Now we’ll introduce a useful properties of averages.

Example 3.13 Recall your tactile simulation from Example 3.5). Let \(U_1\) be the result of the first roll, and \(U_2\) the result of the second, so the sum is \(X = U_1 + U_2\).

  1. Donny Don’t says: “\(X=U_1+U_2\), so I can find the average value of \(X\) by finding the average value of \(U_1\), the average value of \(U_2\), and adding the two averages”. Do you agree? Explain.
  2. Donny Don’t says: “\(U_1\) and \(U_2\) have the same distribution, so they have the same average value, so I can find the average value of \(X\) by multiplying the average value of \(U_1\) by 2”. Do you agree? Explain.
  3. Donny Don’t says: “\(U_1\) and \(U_2\) have the same distribution, so \(X=U_1+U_2\) has the same distribution as \(2U_1 = U_1 + U_1\)”. Do you agree? Explain.

Solution 3.13.

  1. Donny is correct! Our simulation results are in Table 3.10. The average value of \(U_1\) is \[ \frac{2 + 1 + 3 + 4 + 3 + 3 + 2 + 2 + 1 + 3}{10} = 2.4 \] The average value of \(U_2\) is \[ \frac{1 + 1 + 3 + 3 + 2 + 4 + 3 + 4 + 2 + 4}{10} = 2.7 \] The sum of these two values is equal to the average value of \(X\). To see why, suppose we had just performed two repetitions, resulting in the last two rows of Table 3.10). \[ {\scriptsize \text{Average of $(U_1+U_2)$} = \frac{(1 + 2) + (3 + 4)}{2} = 5 = 2 + 3= \left(\frac{1 + 3}{2}\right)+\left(\frac{2 + 4}{2}\right) = (\text{Average of $U_1$}) + (\text{Average of $U_1$}) } \] We discuss further below.
  2. Donny is correct that \(U_1\) and \(U_2\) have the same distribution, and he has some good ideas about averages. But we should remind Donny that a distribution represents the long run pattern of variability. With only 10 repetitions, the results for \(U_1\) will not necessarily follow the same pattern as those for \(U_2\). In our simulation, the average value of \(U_1\) is 2.4 and the average value of \(U_2\) is 2.7. Multiplying neither one of these numbers by 2 yields the average value of \(X\). Donny would have been correct if he were talking about long run average values. Since \(U_1\) and \(U_2\) have the same distribution, the long run average value of \(U_1\) is equal to the long run average value of \(U_2\), and so the long run average value of \(X\) is equal to the long run average value of \(U_1\) multiplied by two.
  3. Donny is not correct. In particular, \(X\) and \(2U_1\) do not have the same possible values; for example, \(X\) can be 3 but \(2U_1\) cannot. The long run average value is just one feature of a distribution. Just because \(X\) and \(2U_1\) have the same long run average value does not necessarily mean they have the same full long run pattern of variability. In particular, relationships between random variables will affect distributions of transformations of them. \(U_1\) and \(U_2\) have the same marginal distribution, but the joint distribution of \((U_1, U_2)\) is not the same as that of \((U_1, U_1)\), and so the distribution of \(U_1+U_2\) is not the same as that of \(U_1+U_1\).

In general the order of transforming and averaging is not interchangeable. However, the order is interchangeable for linear transformations.

Theorem 3.1 (Linearity of averages) If \(X\) and \(Y\) are random variables and \(m\) and \(b\) are non-random constants, whether in the short run or the long run,

\[\begin{align*} \text{Average of $(mX+b)$} & = m(\text{Average of $X$})+b\\ \text{Average of $(X+Y)$} & = \text{Average of $X$} +\text{Average of $Y$} \end{align*}\]

The following just restates in terms of long run averages (expected values)

Theorem 3.2 (Linearity of expected value) If \(X\) and \(Y\) are random variables and \(m\) and \(b\) are non-random constants,

\[\begin{align*} \textrm{E}(mX+b) & = m\textrm{E}(X)+b\\ \textrm{E}(X+Y) & = \textrm{E}(X) + \textrm{E}(Y) \end{align*}\]

For example, if \(X\) measures temperature in degrees Celsius with average 24°C, then \(1.8X+32\) measures temperature in degrees Fahrenheit with average 75.2°F.

Averaging involves adding and dividing. Linear transformations involve only adding/subtracting and multiplying/dividing. The ability to interchange the order of averaging and linear transformations follows simply from basic properties of arithmetic (commutative, associative, distributive).

Note that the average of the sum of \(X\) and \(Y\) is the sum of the average of \(X\) and the average of \(Y\) regardless of the relationship between \(X\) and \(Y\). We will explore this idea in more detail later.

3.7.5 Averages of indicator random variables

Recall that indicators are the bridge between events and random variables. Indicators are also the bridge between relative frequencies and averages.

Example 3.14 Recall your tactile simulation from Example 3.5. Let \(A\) be the event that the first roll is 3 and \(\textrm{I}_A\) the corresponding indicator random variable. Based only on the results of your simulation, approximate the long run average value of each of \(\textrm{I}_A\). What do you notice?

Solution 3.14. Our simulation results are in Table 3.10). Approximate the long run average value of \(\textrm{I}_A\) by summing the 10 simulated values of \(\textrm{I}_A\) and dividing by 10. \[ \frac{0 + 0 + 1 + 0 + 1 + 1 + 0 + 0 + 0 + 1}{10} = \frac{4}{10} \] The average of \(\textrm{I}_A\) is the relative frequency of event \(A\)! When we sum the 1/0 values of \(\textrm{I}_A\) we count the repetitions on which \(A\) occurs. That is, the numerator in the average calculation for \(\textrm{I}_A\) is the frequency of event \(A\), and dividing by the number of repetitions yields the relative frequency of event \(A\).

If \(\textrm{I}_A\) is the indicator random variable of an event \(A\), whether in the short run or the long run, \[ \text{Average of $\textrm{I}_A$} = \text{Relative frequency of $A$} \] In terms of our long run notation, the above long run result is \[ \textrm{E}(\textrm{I}_A) = \textrm{P}(A) \]

Indicators provide a bridge between events and random variables, and between probability and expected value.

3.7.6 Exercises

3.8 Measuring variability: Variance and standard deviation

The long run average value is just one feature of a distribution. Random variables vary, and the distribution describes the entire pattern of variability. It is also convenient to measure the overall degree of variability in a single number. Some values of a variable are close to its mean and some are far, so to measure variability in a single number we might ask: how far are the values away from the mean on average? Variance and standard deviation are numbers that address this question.

Consider again a random variable \(X\) that follows a Normal(30, 10) distribution; we’ll assume \(X\) is measured in minutes (after noon) as in the meeting problem. Any Normal distribution has two parameters, the mean and the standard deviation; the Normal(30, 10) distribution has mean 30 (minutes) and standard deviation 10 (minutes). In Section 3.7.3 we saw simulation evidence supporting that the mean is 30. Now we’ll investigate the standard deviation, which measures, roughly, the average distance from the mean.

We’ll motivate the calculation of standard deviation using using just the 10 simulated values in Table 3.5, reproduced in Table 3.12. For now focus on only the first three columns of Table 3.12. We see how far a value is away from the mean by subtracting the mean, resulting in a “deviation”. Deviations are positive for values that are above the mean and negative for values that are below the mean. For example, the value 21.387 is 11.973 minutes below the mean.

Table 3.12: Calculation of standard deviation based on 10 simulated values
repetition x deviation absolute_deviation squared_deviation
1 21.387 -11.973 11.973 143.345
2 50.719 17.359 17.359 301.347
3 24.090 -9.270 9.270 85.940
4 32.680 -0.680 0.680 0.462
5 54.511 21.151 21.151 447.347
6 38.117 4.757 4.757 22.629
7 16.819 -16.541 16.541 273.619
8 22.980 -10.380 10.380 107.746
9 28.867 -4.493 4.493 20.187
10 43.430 10.070 10.070 101.411
Sum 333.600 0.000 106.674 1504.033
Average 33.360 0.000 10.667 150.403

We might try to compute the average deviation from the mean, but this will always be 0. The mean is the balance point—that is, the center of gravity—so the (positive) deviations above the mean will always balance out, in total, with the (negative) deviations below the mean. However, regarding degree of variability, we only care about how far values are away from the mean, not if they’re above or below. So we take the absolute value of each deviation and then find the average. See the fourth column of Table 3.12, resulting in an average absolute deviation of 10.667 minutes.

Table 3.12 motivates the calculation, but we’re really interested in what happens in the long run, so let’s carry out the calculation for many simulated values.

X = RV(Normal(30, 10))

x = X.sim(10000)

x
Index Result
028.068717202953426
141.071061638978165
228.4750271952485
314.379812965123037
427.03556958491586
542.484188523740336
616.655112404664404
760.66527172082902
845.331750559123904
......
999934.478376043825364

The mean is about 30.

x.mean()
29.99318245098022

Now we compute the absolute deviations from the mean.

abs(x - x.mean())
Index Result
01.9244652480267952
111.077879187997944
21.5181552557317204
315.613369485857184
42.9576128660643626
512.491006072760115
613.338070046315817
730.672089269848797
815.338568108143683
......
99994.4851935928451425

Then we average the absolute deviations.

abs(x - x.mean()).mean()
8.027786452205826

Unfortunately, the above calculation yields roughly 8 rather than the standard deviation of 10. It turns out that for a Normal(30, 10) distribution the long run average absolute deviation from the mean is about 8. We can conceptualize standard deviation as average distance from the mean, but the actual calculation of standard deviation is a little more complicated. Technically, we must first square all the distances and then average; the result is the variance. Then the square root of the variance is the standard deviation31.

\[\begin{align*} \text{Variance of } X & = \text{Average of } [(X - \text{Average of } X)^2]\\ \text{Standard deviation of } X & = \sqrt{\text{Variance of } X} \end{align*}\]

Returning to the 10 simulated values in Table 3.12, the fifth column contains the squared deviations. Notice that squaring the deviations has a similar effect to taking the absolute value: a value that is 3 units above the mean has the same squared deviation as a value that is 3 units below the mean (since \(3^2 = (-3)^2\)). Now we average the squared deviations to obtain the variance. The average is computed in the usual way: sum all the values and divide by the number of values32. But now each value included in the average is the squared deviation of \(X\) from the mean, rather than the value of \(X\) itself. The variance of the 10 simulated values is 150.403. This probably seems like a weird number, and it is: because we squared all the deviations before averaging, the measurement units of the variance are minutes2. To get back to the original measurement units, we take the square root of the variance, \(\sqrt{150.403}\), resulting in the standard deviation of 12.264 minutes.

Now back to the long run, using 10000 simulated values x. The following code shows the “long way” of computing variance and standard deviation. First, find the squared distance between each simulated value and the mean.

(x - x.mean()) ** 2 
Index Result
03.7035664908628343
1122.71940730387799
22.3047953805058454
3243.77730670189624
48.747473865509454
5156.02523270973006
6177.90411256042722
7940.7770601775737
8235.27167160816248
......
999920.116961565299118

Then compute the average squared deviation to get the variance.

((x - x.mean()) ** 2).mean()
100.33177583839742

Now take the square root of the variance to get the standard deviation.

sqrt(((x - x.mean()) ** 2).mean())
10.016575055296967

We see that the standard deviation is about 10—as it should be for values simulated from a Normal(30, 10) distribution—and the variance is about \(10^2 = 100\). Fortunately, var and sd will carry out these calculations more quickly.

x.var()
100.33177583839742
x.sd()
10.016575055296967

Example 3.15 We’ll compare long run average and standard deviation for the Uniform(0, 60) distribution and the Normal(30, 10) distribution.

  1. Make an educated guess for the long run average value of a Uniform(0, 60) distribution.
  2. Will the standard deviation for a Uniform(0, 60) distribution be greater than, less than, or equal to 10, the standard deviation for a Normal(30, 10) distribution? Explain without doing any calculations. (Hint: It might help to compare the spinners in Section 3.2 or the simulations from Section 3.4.)
  3. Make an educated guess for the standard deviation of a Uniform(0, 60) distribution.

Solution 3.15.

  1. It seems reasonable that the long run average value of a Uniform(0, 60) distribution is 30, the balance point of the distribution.
  2. While the Uniform(0, 60) and Normal(30, 10) distributions have the same mean of 30, the Uniform(0, 60) has a larger standard deviation than the Normal(30, 10) distribution. In comparison to a Normal(30, 10) distribution, a Uniform(0, 60) distribution will give higher probability to ranges of values near the extremes of 0 and 60, as well as lower probability to ranges of values near 30. Thus, there will be more values far from the mean of 30 and fewer values close, and so the average distance from the mean and hence standard deviation will be larger for the Uniform(0, 60) distribution than for the Normal(30, 10) distribution. See Figure 3.22 which compares histograms of simulated values from the two distributions.
  3. In a Uniform(0, 60) distribution, values are “evenly spread” from 0 to 60, so distances from the mean are “evenly spread” from 0 (for 30) to 30 (for 0 and 60). We might expect the standard deviation—roughly the average distance from the mean—to be about 15, halfway between 0 and 30. It turns out that the standard deviation is about 17; 15 is the average absolute deviation from the mean. While the “average distance” interpretation helps our conceptual understanding of standard deviation, the process of squaring the distances, then averaging, and then taking the square root makes guessing the actual value of standard deviation difficult.
U = RV(Uniform(0, 60))

u = U.sim(10000)

plt.figure()

x.plot()
u.plot()

plt.legend(['Normal(30, 10)', 'Uniform(0, 60)']);
plt.show()
Figure 3.22: Histograms of many simulated values from each of the Normal(30, 10) and Uniform(0, 60) distributions.
u.sd()
17.31123266800004
abs(u - u.mean()).mean()
15.022158128174167

Example 3.16 The plots below summarize hypothetical distributions of quiz scores in six classes. Each quiz score is a whole number between 0 and 10 inclusive. All plots are on the same scale with quiz scores on the horizontal axis, and probability on the vertical axis.

  1. Donny Dont says that of these six plots, C represents the smallest SD, since there is “no variability in the heights of the bars”. Do you agree that C represents “no variability”? Explain.
  2. What is the smallest possible value the SD of quiz scores could be? What would need to be true about the distribution for this to happen? (This scenario might not be represented by one of these six plots.)
  3. Without doing any calculations, identify which of these six classes exhibits the smallest standard deviation of quiz scores.
  4. Without doing any calculations, arrange the classes in order based on their standard deviations from smallest to largest.
  5. In one of the classes, the standard deviation of quiz scores is 5. Which one? Why?
  6. Is the standard deviation in F greater than, less than, or equal to 1? Why?
  7. Provide a ballpark estimate of standard deviation in each case.
Figure 3.23
Figure 3.24
Figure 3.25
Figure 3.26
Figure 3.27
Figure 3.28

Solution 3.16.

  1. We disagree with Donny. Standard deviation measures variability of the values of the variable, not their probabilities. If we were to simulate values according to the distribution in C, we would observe some 0s, some 1s, some 2s, all the way through some 10s (with roughly equal frequency). So there would certainly be variability in the values of the variable. Remember that values of the variable are along the horizontal axis, so standard deviation measures average distance from the mean horizontally.
  2. The smallest possible value of standard deviation is 0, which occurs only if the random variable is a constant (with probability 1). In this context, if every student had the same quiz score, e.g., if 100% of students scored an 8, then the standard deviation would be 0. The distribution plot would have a single spike at a single value.
  3. Remember that values of the variable are along the horizontal axis, so standard deviation measures average distance from the mean horizontally. The distribution in F represents the smallest standard deviation. The mean is 7, most of the scores are 7, and some of the scores are only 1 unit away from the mean. In all other situations, the probability that the random variable is equal to its mean is smaller, and the probability that the random variables takes a value more than 1 unit away from its mean is larger than in F.
  4. In all plots other than F the mean is 5, so the smallest standard deviation occurs where the values tend to be close to 5, and the largest occurs where the values tend to be far from 5. In order from smallest to largest standard deviation: F, E, A, C, B, D.
  5. In D, the score takes values 0 and 10 with probability 0.5. The mean is 5, and all deviations are 5 units away from the mean, so the average squared deviation will be \(5^2\) and the standard deviation will be 5.
  6. Less than 1. Don’t forget that there is a high probability of a deviation of 0 in this case. So the average deviation will be somewhere between 0 and 1.
  7. Some of these are harder than others. In E, many values are 0 units away from mean, many values are 1 unit away, and some values are 2. So the standard deviation in E is maybe around 1. In C, there are values that are 0 units away, and about as many values that are each 1, 2, 3, 4, and 5 units away; we might expect standard deviation to be around 2.5. But remember, while considering the average absolute deviation helps intuition, standard deviation is actually the square root of the average squared deviation so the exact value can be difficult to ascertain without computation. The actual values are: F = 0.45, E = 1.15, A = 2.4, C = 3.1, B = 3.7, D = 5.

Variance and standard deviation are both based on average squared deviations from the mean. Variance has some nice mathematical properties (which we’ll see later), but standard deviation is the more “practical” number since it has the same measurement units as the variable. In computations we often work with variances then take the square root at the end to report and interpret standard deviations.

Standard deviation is the most commonly used single-number measure of variability, but it’s not the only one. Average absolute deviation also measures variability; why not use that? It turns out that squaring deviations, rather than taking absolute values, leads to nicer mathematical properties33. Also, standard deviation is the natural measure of variability for Normal distributions (we’ll investigate why later). While standard deviation is commonly used, there are other, sometimes better, ways to summarize variability. For example, we will see later how percentiles can be used to summarize the pattern of variability.

The following definiton states our notion of variance in expected value terms.

Definition 3.1 (Variance and standard deviation of a random varible) The variance of a random variable \(X\) is \[ \textrm{Var}(X) = \textrm{E}((X-\textrm{E}(X))^2) \] The standard deviation of a random variable \(X\) is \(\textrm{SD}(X) = \sqrt{\textrm{Var}(X)}\).

Recall that an expected value is a long run average. Thefore we can interpret \(\textrm{E}(\cdot)\) as “simulate many values of what’s inside \((\cdot)\) and average”. In this way \[ \textrm{E}((X-\textrm{E}(X))^2) = \text{Long Run Average of } [(X - \text{Long Run Average of } X)^2] \]

Recall that for discrete random variables we can compute expected values as probability-weighted average values. We can compute variances for discrete random variables in a similar way.

Example 3.17 In the dice rolling problem, let \(Y\) be the larger of the two rolls. The marginal distribution of \(Y\) is represented by Table 2.15. In Example 2.44 we computed \(\textrm{E}(Y) = 3.125\).

  1. Compute \(\textrm{Var}(Y)\).
  2. Compute \(\textrm{SD}(Y)\).
  3. Describe in detail how \(\textrm{Var}(Y)\) can be simulated via approximation
  4. Write Symbulate code to approximate \(\textrm{Var}(Y)\) and \(\textrm{SD}(Y)\) and compare to the values from parts 1 and 2.

Solution 3.17.

  1. \(Y\) takes values 1, 2, 3, 4, and \(\textrm{E}(Y)=3.125\), so \((Y-\textrm{E}(Y))^2\) takes values \((1-3.125)^2\), \((2-3.125)^2\), \((3-3.125)^2\), \((4-3.125)^2\), with respective probability 1/16, 3/16, 5/16, 7/16. To compute \(\textrm{E}((Y-\textrm{E}(Y))^2)\) as a probability-weighted average value, multiply each possible of \((Y-\textrm{E}(Y))^2\) by its probability and sum. \[ (1-3.125)^2\times 0.0625 + (2-3.125)^2 \times 0.1875 + (3-3.125)^2 \times 0.3125 + (4-3.125)^2 \times 0.4375 = 0.8594 \] See Table 3.13 for more detail. \(\textrm{Var}(Y) = 0.8594\).
  2. \(\textrm{SD}(Y)=\sqrt{\textrm{Var}(Y)} = \sqrt{0.8594} = 0.9270\).
  3. We have already discussed how to simulate a value of \(Y\) in this content. To approximate the variance
    • Simulate many values of \(Y\)
    • Compute the average of the simulated values of \(Y\) (to approximate \(\textrm{E}(Y)\))
    • From each simulated value of \(Y\) subtract the average, and then square
    • Compute the average of the squared deviations from the previous part to approximate the variance
  4. See below for the code, both the long way and using the var and sd shortcuts. The simulation-based approximations are close to the values from parts 1 and 2.
Table 3.13: Steps in computing the variance of \(Y\) in Example 3.17.
\(y\) \(y - \textrm{E}(Y)\) \((y - \textrm{E}(Y))^2\) \(\textrm{P}(Y=y)\) \((y - \textrm{E}(Y))^2\textrm{P}(Y=y)\)
1 -2.125 4.5156 0.0625 0.2822
2 -1.125 1.2656 0.1875 0.2373
3 -0.125 0.0156 0.3125 0.0049
4 0.875 0.7656 0.4375 0.3350
P = BoxModel([1, 2, 3, 4], size = 2)

Y = RV(P, max)

y = Y.sim(10000)

y
Index Result
03
13
22
34
42
54
64
73
84
......
99994
y.mean()
3.1294
y - y.mean()
Index Result
0-0.12939999999999996
1-0.12939999999999996
2-1.1294
30.8706
4-1.1294
50.8706
60.8706
7-0.12939999999999996
80.8706
......
99990.8706
(y - y.mean()) ** 2
Index Result
00.01674435999999999
10.01674435999999999
21.2755443599999998
30.7579443600000001
41.2755443599999998
50.7579443600000001
60.7579443600000001
70.01674435999999999
80.7579443600000001
......
99990.7579443600000001
((y - y.mean()) ** 2).mean()
0.86265564
y.var()
0.86265564
y.sd()
0.9287925710297213

Example 3.18 Continuing Example 3.17.

  1. Compute \(\textrm{E}(Y^2)\).
  2. Describe in detail how \(\textrm{E}(Y^2)\) can be approximated via simulation.
  3. Compute \(\textrm{E}(Y^2) - (\textrm{E}(Y))^2\). What do you notice?

Solution 3.18.

  1. \(Y\) takes values 1, 2, 3, 4, so \(Y^2\) takes values \(1^2, 2^2, 3^2, 4^2\), with respective probability 1/16, 3/16, 5/16, 7/16. To compute \(\textrm{E}(Y^2)\) as a probability-weighted average value, multiply each possible of \(Y^2\) by its probability and sum. \[ 1^2\times 0.0625 + 2^2 \times 0.1875 + 3^2 \times 0.3125 + 4^2 \times 0.4375 = 10.625 \] \(\textrm{E}(Y^2) = 10.625\). Notice this is not the same as \(\textrm{E}(Y)^2=3.125^2.\)
  2. We have already discussed how to simulate a value of \(Y\) in this content. To approximate \(\textrm{E}(Y^2)\)
    • Simulate many values of \(Y\)
    • Square each simulated value of \(Y\)
    • Compute the average of the squared values from from the previous part to approximate \(\textrm{E}(Y^2)\). See the Symbulate code below.
  3. \(\textrm{E}(Y^2) - (\textrm{E}(Y))^2 = 10.625 - 3.125^2 = 0.8594\), which is \(\textrm{Var}(Y)\).
(y ** 2)
Index Result
09
19
24
316
44
516
616
79
816
......
999916
(y ** 2).mean()
10.6558
(y ** 2).mean() - (y.mean()) ** 2
0.8626556399999998

Example 3.18 illustrates an alternative formula for computing variance. Definition 3.1 represents the concept of variance. However, variance is usually computed using the following equivalent but slightly simpler formula.

Lemma 3.1 \[ \textrm{Var}(X) = \textrm{E}(X^2) - (\textrm{E}(X))^2 \]

That is, the variance of \(X\) is the expected value of the square of \(X\) minus the square of the expected value of \(X\).

We will see that variance has many nice theoretical properties. Whenever you need to compute a standard deviation, first find the variance and then take the square root at the end.

We will see how to compute variance for continuous random variables later. But Definition 3.1 and Lemma 3.1 apply for both discrete and continuous random variables. Furthermore, variance can be approximated via simulation in the same way for discrete or continuous random variables: simulate many values of the random variable and compute the average of the squared deviations from the mean.

3.8.1 Standardization

Standard deviation provides a “ruler” by which we can judge a particular value of a random variable relative to the distribution of values. This idea is particularly useful when comparing random variables with different measurement units but whose distributions have similar shapes.

Example 3.19 SAT scores have, approximately, a Normal distribution with a mean of 1050 and a standard deviation of 200. ACT scores have, approximately, a Normal distribution with a mean of 21 and a standard deviation of 5.5. Darius’s score on the SAT is 1500. Alfred’s score on the ACT is 31. We’ll investigate who scored relatively better on their test.

  1. Compute the deviation from the mean for Darius’s SAT score. How large is this value relative to the standard deviation for SAT scores (which measures, roughly, the average deviation from the mean)?
  2. Compute the deviation from the mean for Alfred’s ACT score. How large is this value relative to the standard deviation for ACT scores?
  3. Who scored relatively better on their test? Explain.

Solution 3.19.

  1. Darius’s score is \(1500-1050 = 450\) points above the mean SAT score. The standard deviation of SAT scores is 200 points. Roughly, the deviation of Darius’s score from the mean is \(450/200 = 2.25\) times larger the average deviation.
    That is, Darius’s SAT score of 1500 is 2.25 standard deviations above the mean SAT score.
  2. Alfred’s score is \(31-21 = 10\) points above the mean ACT score. The standard deviation of ACT scores is 5.5 points. Roughly, the deviation of Alfred’s score from the mean is \(10/5.5 = 1.82\) times larger than the average deviation. That is, Alfred’s ACT score of 31 is 1.82 standard deviations above the mean ACT score.
  3. Both scores are above average, but relative to other test scores, Darius’s is farther above average than Alfred’s. Both distributions are Normal, so the probability that an SAT score is greater than Darius’s is smaller than the probability that an ACT score is greater than Alfred’s. That is, Darius scored relatively better. See Figure 3.29.
Figure 3.29: Comparison of the Normal distributions in Example 3.19.

Standard deviation provides a “ruler” by which we can judge a particular value of a random variable relative to the distribution of values. Consider the plot for SAT scores in Figure 3.29. For now, pay attention to the two horizontal axis scales on each plot; we’ll discuss the Normal “bell shape” in more detail later. There are two scales on the variable axis: one representing the actual measurement units, and one representing “standardized units”. In the standardized scale, values are measured in terms of standard deviations away from the mean:

  • The mean corresponds to a value of 0.
  • A one unit increment on the standardized scale corresponds to an increment equal to the standard deviation in the measurement unit scale.

For example, each one unit increment in the standardized scale corresponds to a 200 point increment in the measurement unit scale for SAT scores, and a 5.5 point increment in the measurement unit scale for ACT scores. An SAT score of 1250 is “1 standard deviation above the mean”; an ACT score of 10 is “2 standard deviations below the mean”. Given a specific distribution, the more standard deviations a particular value is away from its mean, the more extreme or “unusual” it is.

Given a value of a variable, its standardized value marks where the value lies on the standardized scale (e.g., 1500 on the SAT scale corresponds to 2.25 on the standardized scale).

\[ \text{Standardized value} = \frac{\text{Value - Mean}}{\text{Standard deviation}} \]

Warning

Standardization is useful when comparing random variables with different measurement units but whose distributions have similar shapes (like the two Normal distributions in Figure 3.29). However, standardized values are only based on two features of a distribution—mean and standard deviation—rather than the complete pattern of variability. Distributions with different shapes have different patterns of variability. We will see later than when comparing distributions with different shapes, it is better to compare percentiles rather than standardized values to determine what is “extreme” or “unusual”.

Any random variable can be standardized, but keep in mind that just how extreme any particular standardized value is depends on the shape of the distribution. We will see that standardization is most natural for random variables that follow a Normal distribution.

3.8.2 Exercises

Exercise 3.17  

3.9 Measuring relationships: Covariance and correlation

Quantities like long run average, variance, and standard deviation summarize characteristics of the marginal distribution of a single random variable. When there are multiple random variables their joint distribution is of interest. Covariance and correlation each summarize in a single number a characteristic of the joint distribution of two random variables, namely, the degree to which they “co-deviate from the their respective means”.

We’ll illustrate the computations using the Bivariate Normal model of arrival times in the meeting problem from Section 3.4.3. Covariance is the average of the paired deviations from the respective means. Let \(T\) and \(W\) be the first arrival time and waiting time random variables from the meeting problem in the previous section. Recall that we have already simulated some \((T, W)\) pairs

t_and_w = meeting_sim[2, 3]

The simulated long run averages are (notice that calling mean returns the pair of averages)

t_and_w.mean()
(26.527051604036775, 6.520210743884388)

Now within each simulated \((t, w)\) pair, we subtract the mean of the \(T\) values from \(t\) and the mean of the \(W\) values from \(w\), We can do this subtract in a “vectorized” way: the following code subtracts the first component of t_and_w.mean() (the average of the \(T\)) from the first component of each of the t_and_w values (the individual \(t\) value), and similarly for the second.

t_and_w - t_and_w.mean()
Index Result
0(-4.628631767908644, -2.5848468141401826)
1(1.976238715319873, -1.752856052191519)
2(4.766336957042803, 0.7305984634351921)
3(-1.7796912967906628, -4.72551667906906)
4(10.218936091990376, 2.8156784514425466)
5(6.987740371075663, -3.909957285397799)
6(-9.54451825493543, -3.811178915398746)
7(-19.282404455410465, -0.642255287327651)
8(1.3960049506982237, -3.3202450077117485)
......
999(1.1116540015586658, 0.3328796276468067)

For each simulated pair, we now have a pair of deviations from the respective mean. To measure the interaction between these deviations, we take the product of the deviations within each pair. First, we split the paired deviations into two separate columns and then take the product within each row

t_deviations = (t_and_w - t_and_w.mean())[0] # first column (Python 0)
w_deviations = (t_and_w - t_and_w.mean())[1] # second column (Python 1)

t_deviations * w_deviations
Index Result
011.964304079106698
1-3.464061992723632
23.482278457029841
38.409960906578323
428.77323815088581
5-27.321766372355608
636.375866730848344
712.384226213877627
8-4.635078468296663
......
9990.3700469701109313

Now we take the average of the products of the paired deviations.

(t_deviations * w_deviations).mean()
-8.648446933123367

This value is the covariance, which we could have arrived at more quickly by calling cov on the simulated pairs.

t_and_w.cov()
np.float64(-8.64844693312337)

The covariance of random variables \(X\) and \(Y\) is defined as the long run average of the product of the paired deviations from the respective means

\[ \text{Covariance($X$, $Y$)} = \text{Average of} ((X - \text{Average of }X)(Y - \text{Average of }Y)) \] It turns out that covariance is equivalent to the average of the product minus the product of the averages.

\[ \text{Covariance($X$, $Y$)} = \text{Average of} XY - (\text{Average of }X)(\text{Average of }Y) \]

(t_and_w[0] * t_and_w[1]).mean() - (t_and_w[0].mean()) * (t_and_w[1].mean()) 
-8.648446933123324

Unfortunately, the value of covariance is hard to interpret as it depends heavily on the measurement unit scale of the random variables. In the meeting problem, covariance is measured in squared-minutes. If we converted the \(T\) values from minutes to seconds, then the value of covariance would change accordingly and would be measured in (minutes \(\times\) seconds).

But the sign of the covariance does have practical meaning.

  • A positive covariance indicate an overall positive association: above average values of \(X\) tend to be associated with above average values of \(Y\)
  • A negative covariance indicates am overall negative association: above average values of \(X\) tend to be associated with below average values of \(Y\)
  • A covariance of zero indicates that the random variables are uncorrelated: there is no overall positive or negative association. But be careful: if \(X\) and \(Y\) are uncorrelated there can still be a relationship between \(X\) and \(Y\). We will see examples later that demonstrate that being uncorrelated does not necessarily imply that random variables are independent.

Example 3.20 Consider the probability space corresponding to two rolls of a fair four-sided die. Let \(X\) be the sum of the two rolls, \(Y\) the larger of the two rolls, \(W\) the number of rolls equal to 4, and \(Z\) the number of rolls equal to 1. Without doing any calculations, determine if the covariance between each of the following pairs of variables is positive, negative, or zero. Explain your reasoning conceptually.

  1. \(X\) and \(Y\)
  2. \(X\) and \(W\)
  3. \(X\) and \(Z\)
  4. \(X\) and \(V\), where \(V = W + Z\).
  5. \(W\) and \(Z\)

Solution 3.20.

  1. Positive. There is an overall positive association; above average values of \(X\) tend to be associated with above average values of \(Y\) (e.g., (7, 4), (8, 4)), and below average values of \(X\) tend to be associated with below average values of \(Y\) (e.g., (2, 1), (3, 2)).
  2. Positive. If \(W\) is large (roll many 4s) then \(X\) (sum) tends to be large.
  3. Negative. If \(Z\) is large (roll many 1s) then \(X\) (sum) tends to be small.
  4. Zero.. Basically, the positive association between \(X\) and \(W\) cancels out with the negative association of \(X\) and \(Z\). \(V\) is large when there are many 1s or many 4s, or some mixture of 1s and 4s. So knowing that W is large doesn’t really tell you anything about the sum.
  5. Negative. There is a fixed number of rolls. If you roll lots of 4s (\(W\) is large) then there must be few rolls of 1s (\(Z\) is small).

The numerical value of the covariance depends on the measurement units of both variables, so interpreting it can be difficult. Covariance is a measure of joint association between two random variables that has many nice theoretical properties, but the correlation (coefficient) is often a more practical measure. (We saw a similar idea with variance and standard deviation. Variance has many nice theoretical properties. However, standard deviation is often a better practical measure of variability.)

The correlation for two random variables is the covariance between the corresponding standardized random variables. Therefore, correlation is a standardized measure of the association between two random variables.

Returning to the \((T, W)\) example, we carry out the same process, but first standardize each of the simulated values of \(T\) by subtracting their mean and dividing by their standardization, and similarly for \(W\).

t_and_w.standardize()
Index Result
0(-0.5047505374249027, -0.5251879996962353)
1(0.21550808179504619, -0.35614449520565616)
2(0.5197672360319449, 0.14844266340800752)
3(-0.1940746604069108, -0.9601283289342668)
4(1.1143711020834617, 0.5720882667442697)
5(0.7620104351657607, -0.794423342374682)
6(-1.0408260929378685, -0.7743535980984382)
7(-2.102738886941172, -0.13049313707904978)
8(0.15223380999934855, -0.6746058700896682)
......
999(0.1212254469252813, 0.06763433129850954)

Now we compute the covariance of the standardized values.

t_and_w.standardize().cov()
np.float64(-0.19162063134563712)

Recall that we already computed the correlation in the previous section.

t_and_w.standardize().corr()
np.float64(-0.19162063134563703)

We see that the correlation is the covariance of the standardized random variables.

\[ \text{Correlation}(X, Y) = \text{Covariance}\left(\frac{X- \text{Average of }X}{\text{Standard Deviation of }X}, \frac{Y- \text{Average of }Y}{\text{Standard Deviation of }Y}\right) \]

When standardizing, subtracting the means doesn’t change the scale of the possible pairs of values; it merely shifts the center of the joint distribution. Therefore, correlation is the covariance divided by the product of the standard deviations.

\[ \text{Correlation}(X, Y) = \frac{\text{Covariance}(X, Y)}{(\text{Standard Deviation of }X)(\text{Standard Deviation of }Y)} \]

A correlation coefficient has no units and is measured on a universal scale. Regardless of the original measurement units of the random variables \(X\) and \(Y\) \[ -1\le \textrm{Correlation}(X,Y)\le 1 \]

  • \(\textrm{Correlation}(X,Y) = 1\) if and only if \(Y=aX+b\) for some \(a>0\)
  • \(\textrm{Correlation}(X,Y) = -1\) if and only if \(Y=aX+b\) for some \(a<0\)

Therefore, correlation is a standardized measure of the strength of the linear association between two random variables. The closer the correlation is to 1 or \(-1\), the closer the joint distribution of \((X, Y)\) pairs hugs a straight line, with positive or negative slope.

Because correlation is computed between standardized random variables, correlation is not affected by a linear rescaling of either variable. A value that is “one standard deviation above the mean” is “one standard deviation above the mean”, whether the variable measured in feet or inches or meters.

3.9.1 Joint Normal distributions

Just as Normal distributions are commonly assumed for marginal distributions of individual random variables, joint Normal distributions are often assumed for joint distributions of several random variables. We’ll focus “Bivariate Normal” distributions which describe a specific pattern of joint variability for two random variables.

Example 3.21 Donny Don’t has just completed a problem where it was assumed that SAT Math scores follow a Normal(500, 100) distribution. Now a follow up problem asks Donny how he could simulate a single (Math, Reading) pair of scores. Donny says: “That’s easy; just spin the Normal(500, 100) twice, once for Math and once for Reading.” Do you agree? Explain your reasoning.

Solution 3.21. You should not agree with Donny, for two reasons.

  • It’s possible that the distribution of SAT Math scores follow a different pattern than SAT Reading scores. So we might need one spinner to simulate a Math score, and a second spinner to simulate the Reading score. (In reality, SAT Math and Reading scores do follow pretty similar distributions. But it’s possible that they could follow different distributions.)
  • Furthermore, there is probably some relationship between scores. It is plausible that students who do well on one test tend to do well on the other. For example, students who score over 700 on Math are probably more likely to score above than below average on Reading. If we simulate a pair of scores by spinning one spinner for Math and a separate spinner for Reading, then there will be no relationship between the scores because the spins are physically independent.

In the previous problem, hat we really need is a spinner that generates a pair of scores simultaneously to reflect their association. This is a little harder to visualize, but we could imagine spinning a “globe” with lines of latitude corresponding to SAT Math score and lines of longitutde to SAT Reading score. But this would not be a typical globe:

  • The lines of latitude would not be equally spaced, since SAT Math scores are not equally likely. (We have seen similar issues for one-dimensional Normal spinners.) Similarly for lines of longitude.
  • The scale of the lines of latitude would not necessarily match the scale of the lines of longitude, since Math and Reading scores could follow difference distributions. For example, the equator (average Math) might be 500 while the prime meridian (average Reading) might be 520.
  • The “lines” would be tilted or squiggled to reflect the relationship between the scores. For example, the region corresponding to Math scores near 700 and Reading scores near 700 would be larger than the region corresponding to Math scores near 700 but Reading scores near 200.

So we would like a model that

  • Simulates Math scores that follow a Normal distribution pattern, with some mean and some standard deviation.
  • Simulates Reading scores that follow a Normal distribution pattern, with possibly a different mean and standard deviation.
  • Reflects how strongly the scores are associated.

Such a model is called a “Bivariate Normal” distribution. There are five parameters: the two means, the two standard deviations, and the correlation which reflects the strength of the association between the two scores.

We have already encountered a Bivariate Normal model. Recall that in one case of the meeting problem we assumes that Regina and Cady tend to arrive around the same time. One way to model pairs of values that have correlation is with a BivariateNormal distribution, like in the following. We assumed a Bivariate Normal distribution for the \((R, Y)\) pairs, with mean (30, 30), standard deviation (10, 10) and correlation 0.7. (We could have assumed different means and standard deviations.)

R, Y = RV(BivariateNormal(mean1 = 30, sd1 = 10, mean2 = 30, sd2 = 10, corr = 0.7))

(R & Y).sim(1000).plot()

Now we see that the points tend to follow the \(R = Y\) line, so that Regina and Cady tend to arrive near the same time, similar to ?fig-meeting-probmeasure3.

Recall that in some of the previous examples the shapes of one-dimensional histograms could be approximated with a smooth density curve. Similarly, a two-dimensional histogram can sometimes be approximated with a smooth density surface. As with histograms, the height of the density surface at a particular \((X, Y)\) pair of values can be represented by color intensity. A marginal Normal distribution is a “bell-shaped curve”; a Bivariate Normal distribution is a “mound-shaped” curve — imagine a pile of sand. (Symbulate does not yet have the capability to display densities in a three-dimensional-like plot such as this plot.)

Figure 3.30: A Bivariate Normal distribution

3.9.2 Exercises

3.10 Conditioning

We can implement conditioning in simulations by only considering repetitions that satisfy the condition. Conditioning can be thought of as taking a subset of, or “filtering”, the simulation results.

3.10.1 Using simulation to approximate conditional probabilities

Conditional probabilities can be approximated using simulated relative frequencies in a similar way to unconditional probabilities, but the presence of conditioning requires more care.

Example 3.22 Recall Example 2.45. Consider simulating a randomly selected U.S. adult and determining whether or not the person is age 18-29 and whether or not they use Snapchat. Let \(A\) be the event that the selected adult is age 18-29, \(C\) be the event that the selected adult uses Snapchat, and \(\textrm{P}\) correspond to randomly selecting an American adult. Suppose that \(\textrm{P}(A) = 0.20\), \(\textrm{P}(C) = 0.24\), and \(\textrm{P}(A\cap C) = 0.13\).

  1. Donny Don’t says, “we need two spinners to simulate this situation: One spinner with areas of 0.20 and 0.80 for age 18-29 or not, and another spinner with areas of 0.24 and 0.76 to represent uses Snapchat or not. Then spin each spinner once to simulate one repetition.” Do you agree? Explain.
  2. How could you perform one repetition of the simulation using just a single spinner? (Hint: it needs 4 sectors.)
  3. How could you perform a simulation, using the spinner in the previous part, to estimate \(\textrm{P}(C | A)\)?
  4. What determines the order of magnitude of the the margin of error for your estimate in the previous part?
  5. What is another method for performing the simulation and estimating \(\textrm{P}(C |A)\) that has a smaller margin of error? What is the disadvantage of this method?

Solution 3.22.

  1. Donny’s simulation assumes that age and Snapchat use are independent, but we know that younger people will be more likely to use Snapchat than older people. In general, you can not simulate pairs of events simply from the marginal probabilities of each.
  2. You can construct a spinner for the possible occurrences of the pairs of events—both occur, \(A\) occurs and \(C\) does not, \(C\) occurs and \(A\) does not, neither occurs—and their joint probabilities. From the three given probabilities we can determine (see the interior cells in the two-way table in the solution to Example 2.45):
    • age 18-29 and uses Snapchat: \(\textrm{P}(A\cap C)= 0.13\)
    • age 18-29 and does not use Snapchat: \(\textrm{P}(A\cap C^c)= 0.07\)
    • not age 18-29, and uses Snapchat: \(\textrm{P}(A^c\cap C)= 0.11\)
    • not age 18-29, and does not use Snapchat: \(\textrm{P}(A^c\cap C^c)= 0.69\) See the spinner in Figure 3.31.
  3. The following method fixes the number of total spins, say 10000.
    • Spin the joint spinner from the previous part once to simulate a (age, Snapchat) pair.
    • Repeat a fixed number of times, say 10000.
    • Discard the repetitions on which the person was not age 18-29, that is, the repetitions on which \(A\) did not occur. You would expect to have around 2000 repetitions left.
    • Among the remaining repetitions (on which \(A\) occurred), count the number of repetitions on which \(C\) also occurred. So for the roughly 2000 repetitions for which the person is age 18-29, count the repetitions on which the person also uses Snapchat; you would expect a count of around 1300.
    • Estimate \(\textrm{P}(C|A)\) by dividing the two previous counts to obtain a conditional relative frequency. \[ \textrm{P}(C | A)\approx \frac{\text{Number of repetitions on which both $A$ and $C$ occurred}}{\text{Number of repetitions on which $C$ occurred}} \]
  4. Only those repetitions in which \(A\) occurred are used to estimate \(\textrm{P}(C|A)\). So the order of magnitude of the margin of error is determined by the number of repetitions on which \(A\) occurs. This would be around 2000 repetitions for a margin of error of roughly \(1/\sqrt{2000} = 0.022\), rather than 0.01 based on the original 10000 repetitions.
  5. The previous method simulated a fixed number of repetitions first, and then discarded the ones that did not meet the condition \(A\). We could instead discard repetitions that do not meet the condition as we go, and keep performing repetitions until we get a fixed number, say 10000, that do satisfy the condition \(A\). In this way, the estimate \(\textrm{P}(C |A)\) will be based on the fixed number of repetitions, say 10000, that satisfy event \(A\). The disadvantage is increased computational burden; we will need to simulate and discard many repetitions in order to achieve the desired number that satisfy the condition. Roughly, we would need to simulate about 50000 repetitions in total to obtain 10000 repetitions that satisfy event \(A\). The advantage is a more precise estimate of \(\textrm{P}(C|A)\) than in the previous part.
Figure 3.31: Spinner corresponding to Example 3.22.

The conditional probability of event \(A\) given event \(B\) can be approximated by simulating—according to the assumptions encoded in the probability measure \(\textrm{P}\)—the random phenomenon a large number of times and computing the relative frequency of \(A\) among repetitions on which \(B\) occurs.

\[ {\small \textrm{P}(A|B) \approx \frac{\text{number of repetitions on which $A$ and $B$ occur}}{\text{number of repetitions on which $B$ occurs}}, \quad \text{for a large number of repetitions simulated according to $\textrm{P}$} } \] When using simulation to approximate \(\textrm{P}(A|B)\), we first use \(B\) to determine which repetitions to keep, then we find the relative frequency of \(A\).

Remember that the margin of error when approximating a probability based on a simulated relative frequency depends on the number of independently simulated repetitions used to calculate the relative frequency. When approximating \(\textrm{P}(A|B)\) as above, the margin of error is determined by the number of independently simulated repetitions on which \(B\) occurs.

There are two basic ways to use simulation34 to approximate a conditional probability \(\textrm{P}(A|B)\).

  • Simulate the random phenomenon for a set number of repetitions (say 10000), discard those repetitions on which \(B\) does not occur, and compute the relative frequency of \(A\) among the remaining repetitions (on which \(B\) does occur).
    • Disadvantage: the margin of error is based on only the number of repetitions used to compute the relative frequency. So if you perform 10000 repetitions but \(B\) occurs only on 1000, then the margin of error for estimating \(\textrm{P}(A|B)\) is roughly on the order of \(1/\sqrt{1000} = 0.032\) (rather than \(1/\sqrt{10000} = 0.01\)). Especially if \(\textrm{P}(B)\) is small, the margin of error could be large resulting in an imprecise estimate of \(\textrm{P}(A|B)\).
    • Advantage: not computationally intensive.
  • Simulate the random phenomenon until obtaining a certain number of repetitions (say 10000) on which \(B\) occurs, discarding those repetitions on which \(B\) does not occur as you go, and compute the relative frequency of \(A\) among the remaining repetitions (on which \(B\) does occur).
    • Advantage: the margin of error will be based on the set number of repetitions on which \(B\) occurs.
    • Disadvantage: requires more time/computer power. Especially if \(\textrm{P}(B)\) is small, it will require a large number of repetitions of the simulation to achieve the desired number of repetitions on which \(B\) occurs. For example, if \(\textrm{P}(B)=0.1\), then it will require roughly 100,000 repetitions in total to obtain 10,000 on which \(B\) occurs.

3.10.2 Conditioning in Symbulate

In Symulate, filter can be used to extract repetitions that satisfy a condition. First we’ll simulate age and Snapchat use for 10000 hypothetical adults. Each ticket in the BoxModel has a (age, Snapchat) pair of labels, like the spinner in Figure 3.31.

P = BoxModel([('Age 18-29', 'Snapchat user'), ('Age 18-29', 'Not Snapchat user'), ('Not Age 18-29', 'Snapchat user'), ('Not Age 18-29', 'Not Snapchat user')],
             probs = [0.13, 0.07, 0.11, 0.69])

sim_all = P.sim(10000)

sim_all
Index Result
0('Not Age 18-29', 'Not Snapchat user')
1('Not Age 18-29', 'Not Snapchat user')
2('Not Age 18-29', 'Not Snapchat user')
3('Not Age 18-29', 'Not Snapchat user')
4('Age 18-29', 'Snapchat user')
5('Not Age 18-29', 'Not Snapchat user')
6('Age 18-29', 'Snapchat user')
7('Not Age 18-29', 'Not Snapchat user')
8('Not Age 18-29', 'Not Snapchat user')
......
9999('Not Age 18-29', 'Not Snapchat user')
sim_all.tabulate()
Outcome Frequency
('Age 18-29', 'Not Snapchat user')702
('Age 18-29', 'Snapchat user')1293
('Not Age 18-29', 'Not Snapchat user')6871
('Not Age 18-29', 'Snapchat user')1134
Total10000

Now we’ll apply filter to retain only those age 18-29. The function is_Age18 takes as an input35 a (age, Snapchat) pair and returns True if age 18-29 (and False otherwise). Applying filter(is_Age18) will only return results for which is_Age18 returns True.

def is_Age18(Age_Snapchat):
    return Age_Snapchat[0] == 'Age 18-29'
  
sim_given_Age18 = sim_all.filter(is_Age18)

sim_given_Age18
Index Result
0('Age 18-29', 'Snapchat user')
1('Age 18-29', 'Snapchat user')
2('Age 18-29', 'Not Snapchat user')
3('Age 18-29', 'Snapchat user')
4('Age 18-29', 'Snapchat user')
5('Age 18-29', 'Not Snapchat user')
6('Age 18-29', 'Not Snapchat user')
7('Age 18-29', 'Snapchat user')
8('Age 18-29', 'Not Snapchat user')
......
1994('Age 18-29', 'Snapchat user')
sim_given_Age18.tabulate()
Outcome Frequency
('Age 18-29', 'Not Snapchat user')702
('Age 18-29', 'Snapchat user')1293
Total1995

Conditional relative frequencies are computed based only on repetitions which satisfy the event being conditioned on.

sim_given_Age18.tabulate(normalize = True)
Outcome Relative Frequency
('Age 18-29', 'Not Snapchat user')0.3518796992481203
('Age 18-29', 'Snapchat user')0.6481203007518797
Total1.0

Based on these results, we would estimate \(\textrm{P}(C |A)\) to be around the true value of 0.65 with a margin of error of roughly \(1/\sqrt{2000} = 0.022\).

In Symbulate, the “given” symbol | applies the second method to simulate a fixed number of repetitions that satisfy the event being conditioned on. Be careful when using | when conditioning on an event with small probability. In particular, be careful when conditioning on the value of a continuous random variable (which we’ll discuss in more detail soon).

Below we use RV syntax to carry out the simulation and conditioning36. Technically, a random variable always returns a number, but RV in Symbulate does allow for non-numerical outputs. In most situations, we will usually deal with true random variables, and the code syntax below will be more natural.

The following code simulates (Age, Snapchat) pairs until 10000 adults age 18-29 are obtained.

Age, Snapchat = RV(P)

sim_given_Age18 = ( (Age & Snapchat) | (Age == 'Age 18-29') ).sim(10000)

sim_given_Age18
Index Result
0(Age 18-29, Not Snapchat user)
1(Age 18-29, Snapchat user)
2(Age 18-29, Not Snapchat user)
3(Age 18-29, Not Snapchat user)
4(Age 18-29, Snapchat user)
5(Age 18-29, Not Snapchat user)
6(Age 18-29, Not Snapchat user)
7(Age 18-29, Not Snapchat user)
8(Age 18-29, Snapchat user)
......
9999(Age 18-29, Not Snapchat user)
sim_given_Age18.tabulate()
Value Frequency
(Age 18-29, Not Snapchat user)3449
(Age 18-29, Snapchat user)6551
Total10000

Since all 10000 simulated pairs satisfy the event being conditioned on (age 18-29), they are all included in the computation of the conditional relative frequencies.

sim_given_Age18.tabulate(normalize = True)
Value Relative Frequency
(Age 18-29, Not Snapchat user)0.3449
(Age 18-29, Snapchat user)0.6551
Total1.0

Based on these results, we would estimate \(\textrm{P}(C |A)\) to be around the true value of 0.65 with a margin of error of roughly \(1/\sqrt{10000} = 0.01\). We have obtained a more precise estimate, but the simulation takes longer to run.

3.10.3 Using conditional probabilities to simulate

In Example 3.22 we directly simulated pairs of events using joint probabilities, and used the results to approximate conditional probabilities. But in many problems conditional probabilities are provided or can be determined directly.

Example 3.23 Recall Example 2.46. Suppose that

  • 65% of American adults age 18-29 use Snapchat
  • 24% of American adults age 30-49 use Snapchat
  • 12% of American adults age 50-64 use Snapchat
  • 2% of American adults age 65+ use Snapchat

Also suppose that

  • 20% of American adults are age 18-29
  • 33% of American adults are age 30-49
  • 25% of American adults are age 50-64
  • 22% of American adults are age 65+

Consider simulating a randomly selected U.S. adult and determining the person’s age group and whether or not they use Snapchat. How could you perform one repetition of the simulation using spinners based solely on the percentages provided above, without first constructing a two-way table or finding \(\textrm{P}(A_1\cap C)\), etc? (Hint: you’ll need a few spinners, but you might not spin them all in a single repetition.)

Solution 3.23. It’s a two-stage simulation: first spin a spinner to determine age group, then given the result spin an appropriate spinner to determine Snapchat use. There will be 4 spinners for Snapchat use, but only 1 will be spun—along with the age spinner—in any single repetition. See Figure 3.32.

  • “Age” spinner: Areas of 0.20, 0.33, 0.25, 0.22 corresponding to, respectively, 18-29, 30-49, 50-64. 65+. Spin this to determine age group.
  • “Snapchat” spinners—only one of the following will be spun in a single repetition:
    • Snapchat given age 18-29: areas of 0.65 and 0.35 corresponding to, respectively, uses Snapchat and does not use Snapchat. If the result of the “age” spinner is 18-29, spin this spinner to determine whether or not the person uses Snapchat.
    • Snapchat given age 30-49: areas of 0.24 and 0.76 corresponding to, respectively, uses Snapchat and does not use Snapchat. If the result of the “age” spinner is 30-49, spin this spinner to determine whether or not the person uses Snapchat.
    • Snapchat given age 50-64: areas of 0.12 and 0.88 corresponding to, respectively, uses Snapchat and does not use Snapchat. If the result of the “age” spinner is 50-64, spin this spinner to determine whether or not the person uses Snapchat.
    • Snapchat given age 65+: areas of 0.02 and 0.98 corresponding to, respectively, uses Snapchat and does not use Snapchat. If the result of the “age” spinner is 65+, spin this spinner to determine whether or not the person uses Snapchat.
Figure 3.32: Spinners corresponding to Example 3.23. The spinner in the top row is spun first to simulate age group, then the corresponding spinner in the bottom row is spun to determine Snapchat use.

A two-stage simulation like the one above basically implements the multiplication rule, joint = conditional \(\times\) marginal. We first simulate age based on marginal probabilities, then use conditional probabilities to simulate Snapchat use, resulting in (age, Snapchat) pairs with the appropriate joint relationship.

In Symbulate, we can code the scenario in Example 3.23 by defining a custom probability space. An outcome is a (Age, Snapchat) pair. Each of the 5 spinners corresponds to a BoxModel. We define a function that defines how to simulate one repetition in two stages, using the draw method37. Then we use that function to define a custom ProbabilitySpace.

def Age_Snapchat_sim():
    Age = BoxModel(['18-29', '30-49', '50-64', '65+'], probs = [0.20, 0.33, 0.25, 0.22]).draw()
    if Age == '18-29':
        Snapchat = BoxModel(['Use', 'NotUse'], probs = [0.65, 0.35]).draw()
    if Age == '30-49':
        Snapchat = BoxModel(['Use', 'NotUse'], probs = [0.24, 0.76]).draw()
    if Age == '50-64':
        Snapchat = BoxModel(['Use', 'NotUse'], probs = [0.12, 0.88]).draw()
    if Age == '65+':
        Snapchat = BoxModel(['Use', 'NotUse'], probs = [0.02, 0.98]).draw()
    return Age, Snapchat
    
P = ProbabilitySpace(Age_Snapchat_sim)
P.sim(10)
Index Result
0('18-29', 'Use')
1('65+', 'NotUse')
2('65+', 'NotUse')
3('50-64', 'NotUse')
4('18-29', 'NotUse')
5('18-29', 'Use')
6('30-49', 'NotUse')
7('50-64', 'NotUse')
8('50-64', 'NotUse')
......
9('30-49', 'NotUse')

Once we have simulated pairs, we can proceed as in Section 3.10.1. For example, the following approximates conditional probabilities of age group given that the adult uses Snapchat, based on 10000 repetitions which satisfy the condition; compare to part 7 of Example 2.46.

Age, Snapchat = RV(P)

sim_given_Snapchat = ( Age | (Snapchat == 'Use') ).sim(10000)

sim_given_Snapchat.tabulate(normalize = True)
Value Relative Frequency
18-290.5236
30-490.3276
50-640.1309
65+0.0179
Total1.0

3.10.4 Conditioning on values of random variables

If \(X\) is a discrete random variable, we can condition on it equalling the value \(x\) by conditioning on the event \(\{X = x\}\).

Example 3.24 We’ll continue with the dice rolling example: roll two fair four-sided dice and let \(X\) be the sum and \(Y\) the larger of the rolls. For each of the following, describe how to approximate the probability based on 10000 repetitions of a tactile simulation (in principle).

  1. \(\textrm{P}(X = 6 | Y = 4)\).
  2. \(\textrm{P}(Y = 4 | X = 6)\).

Solution 3.24. We have already seen how to simulate \((X, Y)\) pairs.

  1. Simulate an \((X, Y)\) pair. If \(Y=4\) keep the pair, otherwise discard it. Continue until 10000 \((X, Y)\) pairs with \(Y=4\) are obtained. Count the number of pairs for which \(X=6\) and divide by 10000 to approximate \(\textrm{P}(X = 6 | Y = 4)\), with a margin of error of 0.01.
  2. Simulate an \((X, Y)\) pair. If \(X=6\) keep the pair, otherwise discard it. Continue until 10000 \((X, Y)\) pairs with \(X=6\) are obtained. Count the number of pairs for which \(Y=4\) and divide by 10000 to approximate \(\textrm{P}(Y = 4 | X = 6)\), with a margin of error of 0.01.

To approximate a conditional probability of an event involving a random variable \(X\) given the value of a random variable \(Y\), we first need to simulate \((X, Y)\) pairs. The variable being conditioned on (\(Y\) here) determines which pairs to keep, and the other variable determines what to find the relative frequency of.

Here is some code for conditioning on \(\{Y = 4\}\).

P = DiscreteUniform(1, 4) ** 2

X = RV(P, sum)
Y = RV(P, max)

( (X & Y) | (Y == 4) ).sim(10)
Index Result
0(7, 4)
1(8, 4)
2(8, 4)
3(6, 4)
4(7, 4)
5(6, 4)
6(5, 4)
7(6, 4)
8(6, 4)
......
9(5, 4)
(X | (Y == 4) ).sim(10000).tabulate()
Value Frequency
52856
62838
72832
81474
Total10000
(X | (Y == 4) ).sim(10000).tabulate(normalize = True)
Value Relative Frequency
50.2839
60.287
70.2879
80.1412
Total1.0

Below we simulate values of \(X\) given \(Y=y\) for each of the values \(y = 2, 3, 4\). (We omit conditioning on \(Y=1\) since then \(X = 2\) with probability 1.) Each of the three simulations below approximates a separate conditional distribution of \(X\) values.


x_given_Y_eq4 = (X | (Y == 4) ).sim(10000)

x_given_Y_eq3 = (X | (Y == 3) ).sim(10000)

x_given_Y_eq2 = (X | (Y == 2) ).sim(10000)

We plot the three distributions on a single plot to compare how the distribution of \(X\) changes depending on the given value of \(Y\).

x_given_Y_eq4.plot()
x_given_Y_eq3.plot(jitter = True) # shifts the spikes a little
x_given_Y_eq2.plot(jitter = True) # so they don't overlap

Extra care is required when conditioning on the value of a continuous random variable. We’ll consider the Bivariate Normal model of the meeting problem from Section 3.4.

Example 3.25 Suppose we want to approximate the conditional probability that Cady arrives first given that Regina arrives at 12:40.

  1. Donny Don’t writes the Symbulate code below to approximate this probability. What do you think will happen when Donny runs his code?
  2. Can you think of how to fix Donny’s code? (Hint: when we say “Regina arrives at 12:40”, what do we really mean?)
# Donny's code - don't try to run this!
R, Y = RV(BivariateNormal(mean1 = 30, sd1 = 10, mean2 = 30, sd2 = 10, corr = 0.7))

y_given_Req40 = (Y | (R == 40) ).sim(10000)

y_given_Req40.count_lt(40) / 10000

Solution 3.25.

  1. Donny’s code will run forever! Using .sim(10000) with conditioning via | will continue the simulation until obtaining 10000 repetitions that satisfy the condition. Remember, \(R\) is a continuous random variable, so \(\textrm{P}(R = 40)=\textrm{P}(R = 40.00000000000000000\ldots) = 0\). The simulation will never return a single \((R, Y)\) pair with \(Y\) equal to \(40.0000000000000\ldots\), let alone 10000 of them.
  2. “Regina arrives at 12:40” doesn’t really mean \(R\) “is equal to 40” but rather \(R\) “is close enough to 40”, where “close enough” depends on the context. Here, rounding to the nearest minute seems reasonable, in which case “Regina arrives at 12:40” really means “Regina arrives within 0.5 minutes (30 seconds) of 12:40”. So we want to condition on the event \(\{40 - 0.5 < R < 40 + 0.5\}=\{|R - 40|<0.5\}\) rather than \(\{R = 40\}\). See the code below, where we condition on the event (abs(R - 40) < 0.5). Aside from the conditioning event, Donny’s code is fine, but note that we have changed the number of repetitions from 10000 to 1000; see the discussion below.
R, Y = RV(BivariateNormal(mean1 = 30, sd1 = 10, mean2 = 30, sd2 = 10, corr = 0.7))

y_given_Req40 = (Y | (abs(R - 40) < 0.5) ).sim(1000)

y_given_Req40.count_lt(40) / 1000
0.669
Warning

Be careful when conditioning on the value of a continuous random variable. Remember that the probability that a continuous random variable is equal to a particular value is 0. Mathematically, when we condition on \(\{X=x\}\) we are really conditioning on \(\{|X-x|<\epsilon\}\)—the event that the random variable \(X\) is within \(\epsilon\) of the value \(x\)—and seeing what happens in the idealized limit when \(\epsilon\) approaches 0. Practically, \(\epsilon\) represents the degree of precision which determines “close enough” in the problem context, e.g., \(\epsilon=0.01\) if “within 0.01” is close enough. When conditioning on a continuous random variable \(X\) in a simulation, never condition on \(\{X=x\}\); rather, condition on \(\{|X-x|<\epsilon\}\) where \(\epsilon\) represents a suitable degree of precision. What counts as “suitable” depends on the context and the scale of the \(X\) variable. For example, if our variable is household income (U.S. dollars) and we’re conditioning on household income “equal to” 100,000 dollars, then \(\epsilon\) of 1000 dollars might be reasonable.

The event \(\{|X-x|<\epsilon\}\) will have non-zero probability, but the probability might be small. Conditioning on low probability events introduces some computational challenges. Remember that the margin of error in using a relative frequency to approximate a probability is based on the number of independently simulated repetitions used to compute the relative frequency. When approximating a conditional probability with a conditional relative frequency, the margin of error is based only on the number of repetitions that satisfy the condition. Naively discarding repetitions that do not satisfy the condition can be horribly inefficient. For example, when conditioning on an event with probability 0.01 we would need to run about 1,000,000 repetitions to achieve 10,000 that satisfy the condition. There are more sophisticated and efficient methods (e.g., “Markov chain Monte Carlo (MCMC)” methods), but they are beyond the scope of this book. When conditioning on the value of a continuous random variable with | in Symbulate, be aware that you might need to change either the number of repetitions (but try not to go below 1000) or the degree of precision \(\epsilon\) in order for the simulation to run in a reasonable amount of time.

In the Bivariate Normal model of the meeting problem the event \(\{|R - 40| < 0.5\}\) has probability 0.04. Running (Y | (abs(R - 40) < 0.5) ).sim(1000) requires about 25000 repetitions in the background to achieve the 1000 the satisfy the condition. The margin of error for approximating conditional probabilities is roughly \(1/\sqrt{1000} = 0.03\).

R, Y = RV(BivariateNormal(mean1 = 30, sd1 = 10, mean2 = 30, sd2 = 10, corr = 0.7))

( (R & Y) | (abs(R - 40) < 0.5) ).sim(10)
Index Result
0(40.085373008017754, 44.84662841573286)
1(40.445444754019775, 25.685377354798874)
2(39.6277842998428, 47.514050821851896)
3(40.47040460235634, 35.18840927712478)
4(40.44418133273754, 38.879897353811025)
5(39.614371580386276, 27.080890344638735)
6(39.812513011897096, 37.04000017025948)
7(40.209052740481965, 30.770364885576097)
8(40.32668932559085, 36.95611085143652)
......
9(39.545302737587164, 36.96979434636231)

3.10.5 Conditional averages

Section here about how all things can change with conditioning: EV, SD, correlation

Conditioning on the value of a random variable \(X\) in general changes the distribution of another random variable \(Y\). If a distribution changes, its summary characteristics like expected value and variance can change too.

Example 3.26 Roll a fair four-sided die twice. Let \(X\) be the sum of the two rolls, and let \(Y\) be the larger of the two rolls (or the common value if a tie). We found the joint and marginal distributions of \(X\) and \(Y\) in Example @ref(exm:dice-probspace), displayed in the table below.

\(p_{X, Y}(x, y)\)
\(x\) \ \(y\) 1 2 3 4 \(p_{X}(x)\)
2 1/16 0 0 0 1/16
3 0 2/16 0 0 2/16
4 0 1/16 2/16 0 3/16
5 0 0 2/16 2/16 4/16
6 0 0 1/16 2/16 3/16
7 0 0 0 2/16 2/16
8 0 0 0 1/16 1/16
\(p_Y(y)\) 1/16 3/16 5/16 7/16
  1. Find \(\textrm{E}(Y)\). How could you find a simulation-based approximation?
  2. Find \(\textrm{E}(Y|X=6)\). How could you find a simulation-based approximation?
  3. Find \(\textrm{E}(Y|X=5)\). How could you find a simulation-based approximation?
  4. Find \(\textrm{E}(Y|X=x)\) for each possible value of \(x\) of \(X\).
  5. Find \(\textrm{E}(X|Y = 4)\). How could you find a simulation-based approximation?
  6. Find \(\textrm{E}(X|Y = y)\) for each possible value \(y\) of \(Y\).

Solution 3.26.

  1. \(\textrm{E}(Y) = 1(1/16) + 2(3/16) + 3(5/16) + 4(7/16) = 3.125\). Approximate this long run average value by simulating many values of \(Y\) and computing the average.

  2. The conditional pmf of \(Y\) given \(X=6\) places probability 2/3 on the value 4 and 1/3 on the value 3. Compute the expected value using this conditional distribution: \(\textrm{E}(Y|X=6) = 3(1/3) + 4(2/3) = 3.67\). This conditional long run average value could be approximated by simulating many \((X, Y)\) pairs from the joint distribution, discarding the pairs for which \(X\neq 6\), and computing the average value of the \(Y\) values for the remaining pairs.

  3. The conditional pmf of \(Y\) given \(X=5\) places probability 1/2 on the value 4 and 1/2 on the value 3. Compute the expected value using this conditional distribution: \(\textrm{E}(Y|X=5) = 3(1/2) + 4(1/2) = 3.5\). This conditional long run average value could be approximated by simulating many \((X, Y)\) pairs from the joint distribution, discarding the pairs for which \(X\neq 5\), and computing the average value of the \(Y\) values for the remaining pairs.

  4. Proceed as in the previous two parts. Find \(\textrm{E}(Y|X=x)\) for each possible value of \(x\) of \(X\).

    \(x\) \(\textrm{E}(Y|X=x)\)
    2 1(1) = 1
    3 2(1) = 2
    4 2(1/3) + 3(2/3) = 8/3
    5 3(1/2) + 4(1/2) = 3.5
    6 3(1/3) + 4(2/3) = 11/3
    7 4(1) = 4
    8 4(1) = 4
  5. The conditional pmf of \(X\) given \(Y=4\) places probability 2/7 on each of the values 5, 6, 7, and 1/7 on the value 8. Compute the expected value using this conditional distribution: \(\textrm{E}(X|Y=4) = 5(2/7) + 6(2/7) +7(2/7) + 8(1/7)= 6.29\). This conditional long run average value could be approximated by simulating many \((X, Y)\) pairs from the joint distribution, discarding the pairs for which \(Y\neq 4\), and computing the average value of the \(X\) values for the remaining pairs.

  6. Proceed as in the previous part.

    \(y\) \(\textrm{E}(X|Y=y)\)
    1 2(1) = 2
    2 3(2/3) + 4(1/3) = 10/3
    3 4(2/5) + 5(2/5) + 6(1/5) = 4.8
    4 5(2/7) + 6(2/7) +7(2/7) + 8(1/7) = 44/7
U1, U2 = RV(DiscreteUniform(1, 4) ** 2)

X = U1 + U2

Y = (U1 & U2).apply(max)

y_given_Xeq6 = (Y | (X == 6) ).sim(3000)

y_given_Xeq6
Index Result
03
13
23
33
43
53
63
73
84
......
29994
y_given_Xeq6.tabulate()
Value Frequency
31014
41986
Total3000
y_given_Xeq6.mean()
3.662

The conditional expected value (a.k.a. conditional expectation a.k.a. conditional mean), of a random variable \(Y\) given the event \(\{X=x\}\), defined on a probability space with measure \(\textrm{P}\), is a number denoted \(\textrm{E}(Y|X=x)\) representing the probability-weighted average value of \(Y\), where the weights are determined by the conditional distribution of \(Y\) given \(X=x\).

\[\begin{align*} & \text{Discrete $X, Y$ with conditional pmf $p_{Y|X}$:} & \textrm{E}(Y|X=x) & = \sum_y y p_{Y|X}(y|x)\\ & \text{Continuous $X, Y$ with conditional pdf $f_{Y|X}$:} & \textrm{E}(Y|X=x) & =\int_{-\infty}^\infty y f_{Y|X}(y|x) dy \end{align*}\]

Remember, when conditioning on \(X=x\), \(x\) is treated as a fixed constant. The conditional expected value \(\textrm{E}(Y | X=x)\) is a number representing the mean of the conditional distribution of \(Y\) given \(X=x\). The conditional expected value \(\textrm{E}(Y | X=x)\) is the long run average value of \(Y\) over only those outcomes for which \(X=x\). To approximate \(\textrm{E}(Y|X = x)\), simulate many \((X, Y)\) pairs, discard the pairs for which \(X\neq x\), and average the \(Y\) values for the pairs that remain.

Example 3.27 Recall Example @ref(exm:uniform-sum-max-conditional). Consider the probability space corresponding to two spins of the Uniform(1, 4) spinner and let \(X\) be the sum of the two spins and \(Y\) the larger to the two spins (or the common value if a tie).

  1. Find \(\textrm{E}(X | Y = 3)\).
  2. Find \(\textrm{E}(X | Y = 4)\).
  3. Find \(\textrm{E}(X | Y = y)\) for each value \(y\) of \(Y\).
  4. Find \(\textrm{E}(Y | X = 3.5)\).
  5. Find \(\textrm{E}(Y | X = 6)\).
  6. Find \(\textrm{E}(Y | X = x)\) for each value \(x\) of \(X\).

Solution 3.27.

  1. Recall Figure @ref(fig:uniform-sum-max-conditional-plot). All the conditional distributions (slices) are Uniform, but over different ranges of possible values. Remember, the mean of any Uniform distribution is the midpoint of the possible range of values. The conditional distribution of \(X\) given \(Y=3\) is the Uniform(4, 6) distribution, which has mean 5. Therefore, \(\textrm{E}(X | Y = 3)=5\). As an integral, since \(f_{X|Y}(x|3) = 1/2, 4<x<6\), \[ \textrm{E}(X | Y = 3) = \int_4^6 x (1/2)dx = \frac{x^2}{4}\Bigg|_{x=4}^{x=6} = \frac{6^2}{4} - \frac{4^2}{4} = 5. \]
  2. The conditional distribution of \(X\) given \(Y=4\) is the Uniform(5, 8) distribution, which has mean 6.5. Therefore, \(\textrm{E}(X | Y = 4)=6.5\).
  3. For a given \(y\), the conditional distribution of \(X\) given \(Y=y\) is the Uniform(\(y+1\), \(2y\)) distribution, which has mean \(\frac{y+1+2y}{2}=1.5y + 0.5\). Therefore, \(\textrm{E}(X | Y = y)=1.5y + 0.5\). As an integral, since \(f_{X|Y}(x|y) = \frac{1}{y-1}, y+1 < x< 2y\), \[ \textrm{E}(X | Y = y) = \int_{y+1}^{2y} x \left(\frac{1}{y-1}\right)dx = \frac{x^2}{2(y-1)}\Bigg|_{x=y+1}^{x=2y} = \frac{(2y)^2}{2(y-1)} - \frac{(y+1)^2}{2(y-1)} = 1.5y + 0.5. \] In the above, \(y\) is treated like a fixed constant, and we are averaging over values of \(X\) by taking a \(dx\) integral. For a given value of \(y\) like \(y=3\), \(1.5y + 0.5\) is a number like \(1.5(3)+0.5=5\).
  4. The conditional distribution of \(Y\) given \(X=3.5\) is the Uniform(1.75, 2.5) distribution, which has mean 2.125. Therefore, \(\textrm{E}(Y | X = 3.5)=2.125\). As an integral, since \(f_{Y|X}(y|3.5) = 1/0.75, 1.75<y<2.5\), \[ \textrm{E}(Y | X = 3.5) = \int_{1.75}^{2.5} y (1/0.75)dy = \frac{y^2}{1.5}\Bigg|_{y=1.75}^{y=2.5} = \frac{2.5^2}{1.5} - \frac{1.75^2}{1.5} = 2.125. \]
  5. The conditional distribution of \(Y\) given \(X=6\) is the Uniform(3, 4) distribution, which has mean 3.5. Therefore, \(\textrm{E}(Y | X = 6)=3.5\).
  6. There are two general cases. If \(2<x<5\) then the conditional distribution of \(Y\) given \(X=x\) is Uniform(\(0.5x\), \(x-1\)) so \(\textrm{E}(Y|X = x) = \frac{0.5x + x - 1}{2} = 0.75x - 0.5\). If \(5<x<8\) then the conditional distribution of \(Y\) given \(X=x\) is Uniform(\(0.5x\), 4) so \(\textrm{E}(Y|X = x) = \frac{0.5x + 4}{2} = 0.25x +2\). The two cases can be put together as \[ \textrm{E}(Y | X = x) = 0.25x + 0.5\min(4, x-1). \] In the above, \(x\) is treated like a fixed constant, and we are averaging over values of \(Y\) by taking a \(dy\) integral. For a given value of \(x\) like \(x=3.5\), \(0.25x + 0.5\min(4, x-1)\) is a number like \(0.25(3.5) + 0.5\min(4, 3.5-1)=2.125\).

Remember that the probability that a continuous random variable is equal to a particular value is 0; that is, for continuous \(X\), \(\textrm{P}(X=x)=0\). When we condition on \(\{X=x\}\) we are really conditioning on \(\{|X-x|<\epsilon\}\) and seeing what happens in the idealized limit when \(\epsilon\to0\).

When simulating, never condition on \(\{X=x\}\) for a continuous random variable \(X\); rather, condition on \(\{|X-x|<\epsilon\}\) where \(\epsilon\) represents some suitable degree of precision (e.g. \(\epsilon=0.005\) if rounding to two decimal places).

To approximate \(\textrm{E}(Y|X = x)\) for continuous random variables, simulate many \((X, Y)\) pairs, discard the pairs for which \(X\) is not close to \(x\), and average the \(Y\) values for the pairs that remain.

U1, U2 = RV(Uniform(1, 4) ** 2)

X = U1 + U2

Y = (U1 & U2).apply(max)

x_given_Yeq3 = (X | (abs(Y - 3) < 0.05) ).sim(1000)

x_given_Yeq3
Index Result
04.074822197398028
15.7363241947427674
24.530914050711006
34.548919345010007
45.925177295357556
55.538719841055289
64.852599597655303
75.459535140141803
86.017689876333142
......
9995.659608967672627
x_given_Yeq3.plot()
plt.show()

x_given_Yeq3.mean()
5.016335127873408

3.10.6 Exercises

3.11 Independence

For independent events, the multiplication rule simplifies

\[\begin{align*} \text{If $A$ and $B$ are independent then } && \textrm{P}(A \cap B) & = \textrm{P}(A)\textrm{P}(B)\\ \text{If independent then } && \text{Joint} & = \text{Product of Marginals} \end{align*}\]

Similarly, the joint distribution of independent random variables is this product of their marginal distributions. For this reason, independence is represented by the product * syntax in Symbulate.

For example, in the meeting problem, if Regina’s arrival follows a Uniform(0, 60) model, Cady’s follows a Normal(30, 10) model, and they arrive independently of each other, we can simulate pairs of arrivals as follows; note the use of *.

P = Uniform(0, 60) * Normal(30, 10)
P.sim(5)
Index Result
0(54.021611940693255, 18.38245685138002)
1(57.1786283014659, 23.992095477459124)
2(54.00608990735655, 36.58539625284115)
3(24.07589065190313, 29.83889061094854)
4(18.405546046911585, 22.491096155313848)

We can think of * as “spin each spinner independently”, but what * really does is create a joint probability space as the product of two marginal probability spaces.

Warning

Be careful: * plays different roles for probability spaces and named distributions (like Uniform, Normal) than it does for random variables. Using * with probability spaces indicates that the spaces are independent. Likewise, using * with distributions of random variables, as in X, Y = RV(Distribution_X * Distribution_Y), indicates that the random variables are independent since their joint distribution is the product of their marginal distributions. In contrast, using * directly on random variables, as in X * Y, literally means multiplication. (Section 3.15 has further discussion on the difference between a random variable and its distribution.)

Note the two uses of * in the code below. In the P = ... line, * means that the two components of the pair will be simulated independently (e.g., “spin the Uniform(0, 60) spinner then spin the Normal(30, 10) spinner”), resulting in random variables R (first component) and Y (second component) that are independent. In (R * Y), * means multiply the values of R and Y.

P = Uniform(0, 60) * Normal(30, 10)
R, Y = RV(P)

(R & Y & (R * Y) ).sim(5)
Index Result
0(10.081664213522592, 29.07706488159332, 293.14520445103386)
1(45.14772154846136, 18.001342942110348, 812.7196186487582)
2(28.876375191212876, 37.5681789151643, 1084.8328296048971)
3(14.543814081050762, 40.6460389208375, 591.1484331958137)
4(36.49452333299631, 32.78823389294908, 1196.5909668539705)

We could have also coded the following, which we interpret as defining random variables R and Y whose joint distribution is the product of their marginal distributions, Uniform(0, 60) for R and Normal(30, 10) for Y, and thus R and Y are independent.

R, Y = RV(Uniform(0, 60) * Normal(30, 10))

(R & Y & (R * Y) ).sim(5)
Index Result
0(15.821813467071506, 30.328402734970737, 479.8503308269283)
1(20.393580941499504, 45.18618840792188, 921.5081907348015)
2(58.74884449732708, 35.13232987451028, 2063.983784626403)
3(38.84281557308403, 34.908705390872065, 1355.952405392768)
4(32.71831581835543, 25.7569654020773, 842.7245285476191)

Example 3.28 Donny Don’t writes the following code to simulate \(X\) and \(Y\) from Example 2.68, but it returns an error. Can you explain why? Think about it first, but if you need a hint, run the code to see the error. How could you fix the code so it runs properly.

X = RV(BoxModel([2, 3, 4, 5, 6, 7, 8], probs = [1/16, 2/16, 3/16, 4/16, 3/16, 2/16, 1/16]))
Y = RV(BoxModel([1, 2, 3, 4], probs = [1/16, 3/16, 5/16, 7/16]))

(X & Y).sim(10)

Solution 3.28. Donny’s code defines the correct marginal distributions for \(X\) and \(Y\), but to simulate \(X, Y\) pairs we need to know their joint distribution. Recall the moral of Example 2.68 and the discussion following it: marginal distributions alone are not enough to determine the joint distribution.

The error message is “Events must be defined on same probability space”, which in this case really should be “random variables must be defined on same probability space”, but even that is maybe not the most informative message. The idea is that we have not provided enough information to determine how to pair \(X\) and \(Y\) values when running the simulation. If the random variables were defined on the same probability space, the probaility space outcome would be simulated first and then both \(X\) and \(Y\) would be computed for the same outcome, so there would be no problem when simulating pairs. One way to define the random variables on the same probability space is to specify their joint distribution and then simulate pairs directly from the joint distribution.

Recall from Example 2.68 that \(X\) and \(Y\) here are independent, so their joint distribution is the product of their marginal distribution. The code below achieves Donny’s goal, using the * syntax.

X, Y = RV(BoxModel([2, 3, 4, 5, 6, 7, 8], probs = [1/16, 2/16, 3/16, 4/16, 3/16, 2/16, 1/16]) * 
          BoxModel([1, 2, 3, 4], probs = [1/16, 3/16, 5/16, 7/16]))

(X & Y).sim(10)
Index Result
0(7, 4)
1(5, 4)
2(6, 4)
3(5, 4)
4(4, 3)
5(3, 1)
6(7, 4)
7(2, 3)
8(7, 4)
......
9(4, 2)

Example 3.29 Continuing Example 3.29, Donny says: “That error is annoying. If \(X\) and \(Y\) are independent then their marginal distributions determine their joint distribution, so my code should be fine.” How would you respond? (You might agree with Donny here. If so, imagine you have been assigned to argue against Donny.)

Solution 3.29. Donny is correct that if random variables are independent their marginal distributions determine their joint distribution. The issue is that independence is an assumption—see Section 2.7.3 —and Symbulate requires you to be explicit about all your assumptions. There are 3 main assumptions in this example:

  • \(X\) is assumed to have a particular marginal distribution
  • \(Y\) is assumed to have a particular marginal distribution
  • \(X\) and \(Y\) are assumed to be independent

Donny’s code has defined the first two assumptions, but not the third.

Suppose Donny had tried to run (X & Y).sim(10) without first defining X, hence not specifying the first assumption above. That would also result in an error, because there would not be enough information to run the simulation. If you don’t specify the distribution of a random variable—either directly or by defining it as a function on a probability space—Symbulate will not assume a distribution by default.

Likewise, Symbulate won’t assume random variables are independent unless you tell it to. You, like Donny, might want Symbulate to assume independence by default, but that’s just not the way it works. And it’s probably much safer that way because there are many real situations where independence is not a reasonable assumption, and assuming independence by default might lead to wildly inaccurate results.

Symbulate does offer a compromise, via the AssumeIndependent command, which allows code like Donny’s in Example 3.28 while also requiring explicit specification of independence The following code would achieve Donny’s goal in Example 3.28.

X = RV(BoxModel([2, 3, 4, 5, 6, 7, 8], probs = [1/16, 2/16, 3/16, 4/16, 3/16, 2/16, 1/16]))
Y = RV(BoxModel([1, 2, 3, 4], probs = [1/16, 3/16, 5/16, 7/16]))

X, Y = AssumeIndependent(X, Y)

(X & Y).sim(10)
Index Result
0(3, 1)
1(5, 4)
2(5, 3)
3(7, 3)
4(7, 4)
5(8, 4)
6(7, 2)
7(5, 3)
8(5, 4)
......
9(3, 3)

While AssumeIndependent works, we discourage its use for two reasons. First, the code is clunky; in a sense the definition of X and Y with X = RV(...) and Y = RV(...) is incomplete until they are redefined with X, Y = AssumeIndependent(X, Y). More importantly, when dealing with multiple random variables it is their joint distribution that describes fully their probabilitistic behavior. Therefore we encourage you to get in the habit of defining joint distributions (possibly using the “marginal then conditional” construction). In many situations random variables are not independent, so marginal distributions alone are not enough. In the special case of independence, define the joint distribution directly as the product of marginal distributions using * rather than using AssumeIndependent.

3.11.1 Exercises

3.12 Working with Symbulate

In this section we explore a little deeper some aspects of coding simulations with Symbulate.

3.12.1 Two “worlds” in Symbulate

Recall the dice rolling simulation in Section 3.3. Suppose we want to simulate realizations of the event \(A= \{Y < 3\}\). We saw previously that we can define the event (Y < 3) in Symbulate and simulate True/False realizations of it.

P = RV(DiscreteUniform(1, 4) ** 2)

Y = RV(P, max)

A = (Y < 3)

A.sim(10)
Index Result
0False
1True
2True
3False
4False
5True
6False
7False
8True
......
9True

Since event \(A\) is defined in terms of \(Y\), we can also first simulate values of Y, store the results as y, and then determine whether event \(A\) occurs for the simulated y values using38 (y < 3). (The results won’t match the above because we are making a new call to sim.)

y = Y.sim(10)

y
Index Result
03
14
24
34
44
52
63
73
82
......
94
(y < 3)
Index Result
0False
1False
2False
3False
4False
5True
6False
7False
8True
......
9False

These two methods illustrate the two “worlds” of Symbulate, which we call “random variable world” and “simulation world”. Operations like transformations can be performed in either world. Think of random variable world as the “before” world and simulation world as the “after” world, by which we mean before/after the sim step.

Most of the transformations we have seen so far happened in random variable world. For example, we have seen how to define the sum of two dice in random variable world in a few ways, e.g., via

P = BoxModel([1, 2, 3, 4], size = 2)

U = RV(P)

X = U.apply(sum)

The sum transformation is applied to define a new random variable X, before the sim step. We could then call, e.g., (U & X).sim(10).

We could also compute simulated values of the sum in simulation world as follows.

P = BoxModel([1, 2, 3, 4], size = 2)

U = RV(P)

u = U.sim(10)

u
Index Result
0(1, 3)
1(3, 3)
2(3, 3)
3(4, 3)
4(3, 2)
5(1, 2)
6(2, 1)
7(3, 2)
8(4, 2)
......
9(3, 2)
u.apply(sum)
Index Result
04
16
26
37
45
53
63
75
86
......
95

The above code will simulate all the pairs of rolls first, store them as u (which we can think of as a table), and then apply the sum to the simulated values (which adds a column to the table). That is, the sum transformation happens after the sim step.

While either world is “correct”, we generally take the random variable world approach. We do this mostly for consistency, but also to emphasize some of the probability concepts we’ll encounter. For example, the fact that a sum of random variables (defined on the same probability space) is also a random variable is a little more apparent in random variable world. However, it is sometimes more convenient to code in simulation world; for example, if complicated transformations are required.

Example 3.30 Donny Don’t writes the following code to simulate \(X\) and \(Y\) in the dice problem, but it returns an error. Can you explain why? Think about it first, but if you need a hint, run the code to see the error.

X = RV(BoxModel([1, 2, 3, 4], size = 2), sum)
Y = RV(BoxModel([1, 2, 3, 4], size = 2), max)

(X & Y).sim(10)

Solution 3.30. The problem is that there is one box model used to determine values of X and a separate box model for Y; we can’t simulate (X, Y) pairs of values if we’re using different boxes. The code above is basically saying to roll one fair four-sided die twice and compute the sum, then roll another four-sided die twice and compute the max, but we can only compute \(X\) and \(Y\) (X & Y) for the same pair of rolls.

The error message is “Events must be defined on same probability space”, which in this case really should be “random variables must be defined on same probability space”, but the actual message at least reflects the spirit of the error.

Example 3.31 Donny Don’t tries again to simulate just \(X\) in the dice problem, this time in simulation world, but Donny’s code below still returns an error. Can you explain why? Think about it first, but if you need a hint, run the code to see the error.

P = BoxModel([1, 2, 3, 4], size = 2)
U1, U2 = RV(P)

u1 = U1.sim(10)
u2 = U2.sim(10)

x = u1 + u2

Solution 3.31. The error is basically the same as in Example 3.30. Running U1.sim(10) simulates one set of pairs of rolls to compute u1, but running u2 = U2.sim(10) simulates a separate set of pairs of rolls to compute u2. The error occurs when running u1+u2 because it only makes sense to add the first roll and second roll for the same pairs of rolls.

The error message is “Results objects must come from the same simulation”, which is maybe not the most informative message, but at least reflects the spirit of the error.

Example 3.32 Continuing Example 3.31, Donny says: “That error is annoying. If \(U_1\) and \(U_2\) are independent then I should be able to simulate then independently and then add them, so my code should be fine.” How would you respond? (You might agree with Donny here. If so, imagine you have been assigned to argue against Donny.)

Solution 3.32. This is essentially the same issue we discussed in Solution 3.29. Symbulate requires you to be explicit about all your assumptions. Symbulate won’t let you operate on multiple random variables unless you have been explicit about how they’re related. In particular, Symbulate won’t assume random variables are independent unless you explicitly tell it to when you define the RVs (e.g., by specifying their joint distribution as the product of their marginal distributions using *). You, like Donny, might want the ability to simulate individual independent random variables separately and then combine the results of the simulations, but that’s just not the way Symbulate works. And it’s probably much safer that way because there are many real situations where independence is not a reasonable assumption, and combining the results from independent simulations might lead to wildly inaccurate results.

When working in random variable world, it only makes sense to transform or simulate random variables defined on the same probability space. Likewise, in simulation world, it only makes sense to apply transformations to values generated in the same simulation, with all assumptions—including indepedence, if appropriate—clearly defined, via a single sim step.

The “simulation” step should be implemented in a single call to sim, and in that single step you should simulate realizations of all random variables of interest (e.g., with &). Afterwards you can select certain columns of the simulation results using brackets, and manage these columns in simulation world. For example, if we wanted to sum the rolls in simulation world, we could use the following code39 which has a single call to sim and runs without error.

P = BoxModel([1, 2, 3, 4], size = 2)
U1, U2 = RV(P)

u1_and_u2 = (U1 & U2).sim(10)

u1 = u1_and_u2[0]
u2 = u1_and_u2[1]

x = u1 + u2
x
Index Result
08
15
23
33
44
53
65
76
88
......
95

3.12.2 Brief summary of Symbulate commands

We will see a variety of scenarious throughout the book, but many simulations using Symbulate can be carried out with relatively few commands. The following table comprises the requisite Symbulate commands for a wide variety of situations.

Command Function
BoxModel Define a box model
ProbabilitySpace Define a custom probability space
RV Define random variables, vectors, or processes
RandomProcess Define a discrete or continuous time stochastic process
apply Apply transformations
[] (brackets) Access a component of a random vector, or a random process at a particular time
* (and **) Define independent probability spaces or distributions
AssumeIndependent Assume random variables or processes are independent
| (vertical bar) Condition on events
& Join multiple random variables into a random vector
sim Simulate outcomes, events, and random variables, vectors, and processes
tabulate Tabulate simulated values
plot Plot simulated values
filter (and relatives) Create subsets of simulated values (filter_eq, filter_lt, etc)
count (and relatives) Count simulated values which satisfy some critera (count_eq, count_lt, etc)
Statistical summaries mean, median, sd, var, quantile, corr, cov, etc.
Common models Normal, BivariateNormal, and many others we will see

In addition to Symbulate commands, we can use basic Python commands to

  • define functions to define random variables
  • define for loops to investigate changing parameters
  • customize plots produced by the Symbulate plot command (add axis labels, legends, titles, etc)
  • summarize results of multiple simulations in tables and Matplotlib plots (e.g., for different values of problem parameters)

The next section and Section 3.13 will provide some examples of how we can use Symbulate together with Python (or R) code.

3.12.3 Working with Symbulate output in Python and R

Symbulate has many built in commands that facilitate the summarization of simulation output in basic plots, tables, and statistics. However, it is also possible to convert Symbulate output into a data frame which can be manipulated using (non-Symbulate) Python or R code. This section provides a brief introduction.

Let’s start with some Symbulate code and output (from the dice rolling simulation in Section 3.3).

P = BoxModel([1, 2, 3, 4], size = 2)

X = RV(P, sum)
Y = RV(P, max)

xy = (X & Y).sim(10000)

xy
Index Result
0(6, 4)
1(6, 3)
2(4, 3)
3(7, 4)
4(3, 2)
5(4, 2)
6(6, 4)
7(6, 4)
8(6, 4)
......
9999(4, 2)

pandas—commonly nicknamed as pd—is a popular Python package for data science. We can convert the output of a Symbulate sim to a pandas DataFrame.

import pandas as pd
xy_df = pd.DataFrame(xy)

xy_df
      0  1
0     6  4
1     6  3
2     4  3
3     7  4
4     3  2
...  .. ..
9995  4  3
9996  5  3
9997  4  3
9998  4  2
9999  4  2

[10000 rows x 2 columns]

By default the column names are just their index (0, 1, 2,…), but they can be renamed.

xy_df = xy_df.rename(columns={0: "x", 1: "y"})

xy_df
      x  y
0     6  4
1     6  3
2     4  3
3     7  4
4     3  2
...  .. ..
9995  4  3
9996  5  3
9997  4  3
9998  4  2
9999  4  2

[10000 rows x 2 columns]

Now we can work with the simulation output as we would any pandas DataFrame. Here are just a few examples

xy_df["y"].value_counts(normalize = True).sort_index()
y
1    0.0631
2    0.1848
3    0.3163
4    0.4358
Name: proportion, dtype: float64
xy_df["x"].value_counts().sort_index().plot.bar()
plt.show()

pd.crosstab(xy_df["x"], xy_df["y"])
y    1     2     3     4
x                       
2  631     0     0     0
3    0  1236     0     0
4    0   612  1226     0
5    0     0  1314  1191
6    0     0   623  1226
7    0     0     0  1282
8    0     0     0   659

R is another popular software environment for statistical computing and graphics. Symbulate is a Python package, but there are a number of ways to interface between R and Python. Using an IDE such as Positron, you can toggle between R and Python (as well as other formats). The following provides examples of code cells you can include in a Quarto document that you can edit in Positron and run both Python and R code.

Using Quarto, we can first run a Symbulate simulation and convert the output to a pandas DataFrame (xy_df) using a Python code block.

```{python}
P = BoxModel([1, 2, 3, 4], size = 2)

X = RV(P, sum)
Y = RV(P, max)

xy = (X & Y).sim(10000)

xy_df = pd.DataFrame(xy)

xy_df = xy_df.rename(columns={0: "x", 1: "y"})
```

Now we can use the R package reticulate and access xy_df within an R code block (as py$xy_df).

```{r}

library(reticulate)
```
```{r}
py$xy_df |> head()
```
  x y
1 5 4
2 6 4
3 3 2
4 8 4
5 7 4
6 7 4

Now we have an R data frame that we can manipulate using commands from R packages. For example, we can use the R package ggplot2 to create visualizations of Symbulate output.

```{r}

library(ggplot2)
```
```{r}
xy_df = py$xy_df

N = nrow(xy_df)

ggplot(xy_df,
       aes(x = factor(x), y = factor(y))) +
  stat_bin_2d(aes(fill = after_stat(count / N))) +
  labs(x = "Sum of the two rolls",
       y = "Larger of the two rolls",
       fill = "Proportion of pairs",
       title = "10000 pairs of rolls of a fair four-sided die")
```

3.13 Case Study: Matching Problem

Dice rolling provides a simple scenario for us to introduce ideas, but it’s not very exciting. Now we’ll apply ideas from this chapter to investigate a more interesting problem: the matching problem. We’ll also see how we can combine Symbulate and Python code. Our analysis will illustrate sensitity analysis, where we see how probabilities, distributions, and expected values change as we vary parameters or assumptions.

Consider the matching problem for a general \(n\): objects labeled \(1, 2, 3, \ldots, n\) are placed at random in spots labeled \(1, 2, 3, \ldots, n\) with spot 1 the correct spot for object 1, etc. Let the random variable \(X\) count the number of objects (out of \(n\)) that are put back in the correct spot; \(X\) is the number of “matches”. Let \(\textrm{P}\) denote the probability measure corresponding to the assumption that the objects are equally likely to be placed in any spot, so that the \(n!\) possible placements are equally.

We have previously considered the case \(n=4\) (Example 2.2, Example 2.22, Exercise 2.15). By enumerating the \(4! = 24\) possible outcomes (see Table 2.2 and Table 2.9) we found the distribution of \(X\) when \(n=4\), displayed in Table 3.14 and Figure 3.33. When \(n=4\) the probability of at least one match is 0.625, and the expected value of \(X\) is 1.

Table 3.14: Distribution of \(X\), the number of matches in the matching problem with \(n=4\) and uniformly random placement of objects in spots.
x P(X=x)
0 0.3750
1 0.3333
2 0.2500
4 0.0417
Figure 3.33: Distribution of \(X\), the number of matches in the matching problem with \(n=4\) and uniformly random placement of objects in spots.

When \(n=4\) we could enumerate the \(4!=24\) possible outcomes, but such a strategy is not feasible for a general \(n\). For example, if \(n=60\) then there are \(60! \approx 10^{82}\) possible outcomes, which is about the total number of atoms in the observable universe. Therefore we need other strategies to investigate the problem.

We’ll use simulation to perform a sensitivity analysis to investigate how the distribution of \(X\), the probability of at least one match, and the expected value of \(X\) depend on \(n\).

Warning

This is not really a warning but before proceeding, stop to think: what do you expect? How do you expect the probability of at least one match to depend on \(n\)? Will the probability increase, decrease, or stay about the same as \(n\) gets larger? What about the expected value of the number of matches? It’s always a good idea to think about a problem and make some initial guesses before just jumping into calculations or simulations.

Example 3.33 For the matching problem for a given \(n\), describe in detail how you would use simulation to approximate:

  1. The distribution of \(X\)
  2. \(\textrm{P}(X \ge 1)\)
  3. \(\textrm{E}(X)\).

Solution 3.33. We could use a box model.

  • The box has \(n\) tickets, labeled \(1, 2, \ldots, n\), one for each object.
  • An outcome is simulated by selecting \(n\) tickets from the box without replacement and recording their order, e.g., (2, 1, 3, 4) if \(n=4\).
  • Let \(x\) be the number of matches for the simulated outcome, e.g., \(x=2\) for outcome (2, 1, 3, 4).
  • The above two steps consist of one repetition, yielding one realized value of the random variable \(X\).
  • Repeat these steps many times to obtain many simulated values of \(X\).
  1. Summarize the simulated values of \(X\) and their relative frequencies to approximate the distribution of \(X\).
  2. Count the number of repetitions on which \(X\ge 1\) and divide by the total number of repetitions to approximate \(\textrm{P}(X\ge 1)\), the probability of at least one match.
  3. We have seen that an expected value can be interpreted as a long run average value. To approximate \(\textrm{E}(X)\), compute the average of the many simulated \(X\) values—sum all simulated \(X\) values and divide by the number of simulated \(X\) values.

We’ll start by coding a simulation for \(n=4\) so we can compare our simulation results to the analytical results to check that our simulation process is working correctly. Since Python uses zero-based indexing, we label the objects \(0, 1, \ldots, n-1\).

n = 4

labels = list(range(n)) # list of labels [0, ..., n-1]
labels
[0, 1, 2, 3]

Now we define the box model and simulate a few outcomes. Note that replace = False.

P = BoxModel(labels, size = n, replace = False)

P.sim(5)
Index Result
0(2, 0, 3, 1)
1(2, 0, 1, 3)
2(2, 1, 3, 0)
3(3, 0, 2, 1)
4(3, 2, 1, 0)

We simulate many outcomes to check that they are roughly equally likely.

P.sim(24000).tabulate()
Outcome Frequency
(0, 1, 2, 3)1002
(0, 1, 3, 2)1027
(0, 2, 1, 3)1029
(0, 2, 3, 1)972
(0, 3, 1, 2)984
(0, 3, 2, 1)1001
(1, 0, 2, 3)994
(1, 0, 3, 2)975
(1, 2, 0, 3)992
(1, 2, 3, 0)974
(1, 3, 0, 2)944
(1, 3, 2, 0)982
(2, 0, 1, 3)1005
(2, 0, 3, 1)1054
(2, 1, 0, 3)992
(2, 1, 3, 0)1019
(2, 3, 0, 1)1037
(2, 3, 1, 0)1013
(3, 0, 1, 2)998
......
(3, 2, 1, 0)1014
Total24000

Remember that a random variable \(X\) is a function whose inputs are the sample space outcomes. In this example the function “counts matches”, so would like to define \(X\) as X = RV(P, count_matches). Unfortunately, such a function isn’t built in like sum or max, but we can write a custom count_matches Python function. The count_matches function below starts a counter at 0 and then goes spot-by-spot through each spot in the outcome and increments the counter by 1 any time there is a match (along the lines of the discussion in Section 2.3.4). Don’t worry too much about the Python syntax yet. What’s important is that we have defined a function that we can use to define a random variable.

def count_matches(outcome):
    count = 0
    for i in range(0, n, 1):
        if outcome[i] == labels[i]:
            count += 1
    return count
  
outcome = (1, 0, 2, 3) # an example outcome, with 2 matches
count_matches(outcome) # the function evaluated for the example outcome
2

Now we can use the function count_matches to define a Symbulate RV just like we have used sum or max.

X = RV(P, count_matches)

X((1, 0, 2, 3)) # evaluate X for the sample outcome
Warning: Calling an RV as a function simply applies the function that defines the RV to the input, regardless of whether that input is a possible outcome in the underlying probability space.
2

We can simulate many values of \(X\) and use the simulated values to approximate the distribution of \(X\), the probability of at least one match, and the expected value of \(X\).

x = X.sim(10000)

x.tabulate(normalize = True)
Value Relative Frequency
00.3749
10.3338
20.2513
40.04
Total1.0
x.plot()
plt.show()

x.count_geq(1) / x.count()
0.6251
x.sum() / x.count()
0.9964

The simulated distribution of \(X\) is close to the true distribution in Table 3.14 when \(n=4\); the simulated relative frequencies are within the margin of error (about 0.01-0.02 for 10000 simulated values) of the true probabilities. We also see that the average of the simulated \(X\) values is around the theoretical expected value of 1.

It appears that our simulation is working properly for \(n=4\). To investigate a different value of \(n\), we simply need to revise the line n = 4. Because we want to investigate many values of \(n\), we wrap all the above code in a Python function matching_sim which takes n as an input40 and outputs our objects of interest: a plot of the distribution of simulated \(X\) values, the approximation of \(\textrm{P}(X \ge 1)\), and the approximation of \(\textrm{E}(X)\).


def matching_sim(n):
  
    labels = list(range(n))
    
    def count_matches(omega):
        count = 0
        for i in range(0, n, 1):
            if omega[i] == labels[i]:
                count += 1
        return count
    
    P = BoxModel(labels, size = n, replace = False)
    X = RV(P, count_matches)
    
    x = X.sim(10000)
    
    plt.figure()
    x.plot('impulse')
    plt.show()
    
    return x.count_geq(1) / x.count(), x.sum() / x.count()

For example, for \(n=4\) we simply call

matching_sim(4)
(0.6241, 0.9958)

Now we can easily investigate different values of \(n\). For example, for \(n=10\) we see that the approximate probability of at least one match is around 0.63 and the approximate expected value of the number of matches is around 1.

matching_sim(10)
(0.6294, 0.9892)

We can use a for loop to automate the process of changing the value of \(n\), running the simulation, and recording the results. If ns is the list of \(n\) values of interest we basically just need to run

for n in ns:
    matching_sim(n)

In Python we can also use list comprehension

[matching_sim(n) for n in ns]

The table below summarizes the simulation results for \(n=4, \ldots, 10\). The first line defines the values of \(n\), and the second line implements the for loop. We have used the Python package tabulate to add a little formatting to the table. (Don’t confuse this with the Symbulate tabulate method!) Note that we have temporarily redefined matching_sim to remove the lines that produced the plot, but we have not displayed the revised code here. (We will bring the plot back soon.)

from tabulate import tabulate 
ns = list(range(4, 11, 1))

results = [matching_sim(n) for n in ns]

print(tabulate({'n': ns,
                'Approximate P(X >= 1), Approximate E(X)': results},
               headers = 'keys', floatfmt=".3f"))
  n  Approximate P(X >= 1), Approximate E(X)
---  -----------------------------------------
  4  (0.6281, 1.0084)
  5  (0.6194, 0.9721)
  6  (0.629, 0.9838)
  7  (0.6389, 1.0096)
  8  (0.6287, 1.0015)
  9  (0.6344, 1.0015)
 10  (0.633, 1.0074)

Stop and look at the table; what do you notice? How do the probability of at least one match and the expected value of the number of matches depend on \(n\)? They don’t! Well, maybe they do, but they don’t appear to change very much with \(n\) after we take into account simulation margin of error41 of about 0.01-0.02. It appears that regardless of the value of \(n\), the probability of at least one match is around 0.63 and the expected value of the number of matches is around 1.

If we’re interested in more values of \(n\), we just repeat the same process with a longer list of ns. The code below uses the Python package Matplotlib to create a plot of the probability of at least one match and the expected value of the number of matches versus \(n\). While there is some natural simulation variability, we see that the probability of at least one match (about 0.63) and the expected value of the number of matches (about 1) essentially do not depend on \(n\)!

from matplotlib import pyplot as plt
ns = list(range(4, 101, 1))

results = [matching_sim(n) for n in ns]

plt.figure()
plt.plot(ns, results)
plt.legend(['Approximate E(X)', 'Approximate P(X >= 1)'])
plt.xlabel('n')
plt.ylabel('value')
plt.show()

What about the distribution of \(X\)? This Colab notebook contains code for investigating how the distribution of \(X\) depends on \(n\). The basic simulation code is identical to what we have already seen. The notebook adds a few lines to create a Jupyter widget, which produces an interactive plot (with some additional formatting) of the distribution; you can change the slider to see how the distribution of \(X\) changes with \(n\). Take a few minutes to play with the slider; what do you see?

You should see that unless \(n\) is really small (like 5 or less) the distribution of \(X\) is basically the same for any value of \(n\)! We will see soon that the number of matches follows, approximately, a “Poisson(1)” distribution, regardless of the value of \(n\) (unless \(n\) is really small). In particular, the probability of 0 matches is approximately equal to 0.37, and also approximately equal to the probability of exactly 1 match.

Table 3.15: Approximate distribution of \(X\), the number of matches in the matching problem for general \(n\) and uniformly random placement of objects in spots.
x P(X=x)
0 0.3679
1 0.3679
2 0.1839
3 0.0613
4 0.0153
5 0.0031
6 0.0005
7 0.0001
Figure 3.34: Approximate distribution of \(X\), the number of matches in the matching problem for general \(n\) and uniformly random placement of objects in spots.

Summarizing, our sensitivity analysis42 of the matching problem reveals that, unless \(n\) is really small,

  • the probability of at least one match does is approximately 0.632
  • the expected value—that is, long run average— of the number of matches is approximately 1
  • the distribution of the number of matches is approximately the “Poisson(1)” distribution.

We will investigate these observations further later. For now, marvel at the fact that no matter if there are 10 or 100 or 1000 people in you next Secret Santa gift exchange, it’s roughly 2 to 1 odds in favor of somebody drawing their own name!

3.14 Simulating from distributions

The distribution of a random variable specifies the long run pattern of variation of values of the random variable over many repetitions of the underlying random phenomenon, or the relative degree of likelihood of possible values. The distribution of a random variable \(X\) can be approximated by

  • simulating an outcome of the underlying random phenomenon \(\omega\) according to the assumptions encoded in the probability measure \(\textrm{P}\)
  • observing the value of the random variable for that outcome \(X(\omega)\)
  • repeating this process many times
  • then computing relative frequencies involving the simulated values of the random variable to approximate probabilities of events involving the random variable, e.g., \(\textrm{P}(X\le x)\).

We have carried out this process for several examples, including the dice rolling example in Section 3.1 and Section 3.3, where each repetition involved simulating a pair of rolls (outcome \(\omega\)) and then finding the sum (\(X(\omega)\)) and max (\(Y(\omega)\)).

Now we’ll discuss another method for simulating values of a random variable.

Example 3.34 Recall Example 2.41, Table 2.14, and Table 2.15.

  1. Construct a spinner to represent the marginal distribution of \(X\).
  2. How could you use the spinner from the previous part to simulate a value of \(X\)?
  3. Construct a spinner to represent the marginal distribution of \(Y\).
  4. How could you use the spinner from the previous part to simulate a value of \(Y\)?
  5. Donny Don’t says: “Great! I can simulate an \((X, Y)\) pair just by spinning the spinner in Figure 3.35 (a) to generate \(X\) and the one in Figure 3.35 (b) to generate \(Y\).” Is Donny correct? If not, can you help him see why not?

Solution 3.34.

  1. The spinner in Figure 3.35 (a) represents the marginal distribution of \(X\) in Table 2.14.
  2. Just spin it! The spinner returns the possible values of \(X\) according to the proper probabilities. If we’re interested in simulating the sum of two rolls of a fair four-sided dice, we don’t have to roll the dice; we can just spin the \(X\) spinner once.
  3. The spinner in Figure 3.35 (b) represents the marginal distribution of \(Y\) in Table 2.15.
  4. Just spin it! The spinner returns the possible values of \(Y\) according to the proper probabilities. If we’re interested in simulating the larger of two rolls of a fair four-sided dice, we don’t have to roll the dice; we can just spin the \(Y\) spinner once.
  5. Donny is incorrect. Yes, spinning the \(X\) spinner in Figure 3.35 (a) will generate values of \(X\) according to the proper marginal distribution, and similarly with Figure 3.35 (b) and \(Y\). However, spinning each of the spinners will not produce \((X, Y)\) pairs with the correct joint distribution. For example, Donny’s method could produce \(X=2\) and \(Y=4\), which is not a possible \((X, Y)\) pair. Donny’s method treats the values of \(X\) and \(Y\) as if they were independent; the result of the \(X\) spin would not change what could happen with the \(Y\) spin (since the spins are physically independent). However, the \(X\) and \(Y\) values are related. For example, if \(X=2\) then \(Y\) must be 1; if \(X=4\) then \(Y\) must be 2 or 3; etc. Later we’ll see a spinner which does properly simulate \((X, Y)\) pairs.
(a) The marginal distribution of \(X\); see Table 2.14
(b) The marginal distribution of \(Y\); see Table 2.15
Figure 3.35: Spinner representing the marginal distributions of \(X\) and \(Y\) in Example 2.41

In principle, there are always two ways of simulating a value \(x\) of a random variable \(X\).

  1. Simulate from the probability space. Simulate an outcome \(\omega\) from the underlying probability space and set \(x = X(\omega)\).
  2. Simulate from the distribution. Simulate a value \(x\) directly from the distribution of \(X\).

The “simulate from the distribution” method corresponds to constructing a spinner representing the distribution of \(X\) and spinning it once to generate \(x\).

Example 3.35 Donny Don’t says: “The”simulate from the distribution” method is dumb. In order to use it, I first need to know the distribution. But the whole point of simulation is to approximate a distribution. If I know the distribution, why do I need to approximate it?” How would you respond to Donny? Hint: it might help to consider the brown bag analogy in Section 2.3.2. It might also help to consider how we use simulation in the meeting problem in this chapter.

Solution 3.35.

This method does require that the distribution of \(X\) is known. However, as we will see in many examples, it is common to specify the distribution of a random variable directly without defining the underlying probability space.

Any marginal distribution can be represented by a single spinner, as the following example illustrates.

Example 3.36 In Section 3.3 we saw the Symbulate code for the “simulate from the probability space” method in the dice rolling example. Now we consider the “simulate from the distribution method”.

  1. Write the Symbulate code to define \(X\) by specifying its marginal distribution.
  2. Write the Symbulate code to define \(Y\) by specifying its marginal distribution.

Solution 3.36. We simulate a value of \(X\) from its marginal distribution by spinning the spinner in Figure 3.35 (a). Similarly, we simulate a value of \(Y\) from its marginal distribution by spinning the spinner in Figure 3.35 (b). We can define a BoxModel corresponding to each of these spinners, and then define a random variable through the identify function. Essentially, we define the random variable by specifying its distribution, rather specifying the underlying probability space. Note that the default size argument in BoxModel is size = 1, so we have omitted it.

Careful: the method below will not work if we want to simulate \((X, Y)\) pairs.

X = RV(BoxModel([2, 3, 4, 5, 6, 7, 8], probs = [1 / 16, 2 / 16, 3 / 16, 4 / 16, 3 / 16, 2 / 16, 1 / 16]))

x = X.sim(10000)
x.tabulate(normalize = True)
Value Relative Frequency
20.0671
30.1312
40.1848
50.24
60.1858
70.129
80.0621
Total1.0
x.plot()

Similarly we can define \(Y\) as

Y = RV(BoxModel([1, 2, 3, 4], probs = [1 / 16, 3 / 16, 5 / 16, 7 / 16]))

BoxModel with the probs option can be used to define general discrete distributions (when there are finitely many possible values). Many commonly encountered distributions have special names, and in Symbulate we can define a random variable to have a specific named distribution through code of the form RV(Distribution(parameters)).

In ?exm-binomial-capture we used the “simulate from the probability space” method to simulate values of \(X\). We can also “simulate from the distribution” in Symbulate by defining X = RV(Binomial(5, 0.25)); read this as “\(X\) is a random variable with a Binomial(5, 0.25) distribution”. This code basically says that values of \(X\) will be simulated using the spinner in ?fig-binomial-capture-2.

X = RV(Binomial(5, 0.25))

X.sim(10)
Index Result
00
10
21
32
42
50
61
72
81
......
91
X.sim(10000).tabulate(normalize = True)
Value Relative Frequency
00.2328
10.3974
20.2593
30.0954
40.0141
50.001
Total1.0

3.15 Do not confuse a random variable with its distribution

We close this chapter with a warning.

Warning

Do not confuse a random variable with its distribution.

A random variable measures a numerical quantity which depends on the outcome of a random phenomenon. The distribution of a random variable specifies the long run pattern of variation of values of the random variable over many repetitions of the underlying random phenomenon43. The distribution of a random variable can be approximated by simulating an outcome of the random process, observing the value of the random variable for that outcome, repeating this process many times, and summarizing the results. But a random variable is not its distribution.

Recall the brown bag analogy in Section 2.3.2. The probability space corresponds to the random selection of fruits to put in the bag. The random variable is weight. The distribution of weight can be obtained by randomly selecting fruits to put in the bag, weighing the bag, and then repeating this process many times to observe many weights, summarize the values, and describe the pattern of variability (e.g., 10% of bags have weights less than 5 pounds, 75% of bags have weights less than 20 pounds, etc). But the random variable is like the scale itself.

A distribution is determined by:

  • The underlying probability measure \(\textrm{P}\), which represents all the assumptions about the random phenomenon.
  • The random variable \(X\) itself, that is, the function which maps sample space outcomes to numbers.

Changing either the probability measure or the random variable itself can change the distribution of the random variable. For example, consider the sample space of two rolls of a fair four-sided die. In each of the following scenarios, the random variable \(X\) has a different distribution.

  1. The die is fair and \(X\) is the sum of the two rolls
  2. The die is fair and \(X\) is the larger of the two rolls
  3. The die is weighted to land on 1 with probability 0.1 and 4 with probability 0.4 and \(X\) is the sum of the two rolls.

In particular,

  1. \(X\) takes the value 2 with probability \(1/16\)
  2. \(X\) takes the value 2 with probability \(3/16\)
  3. \(X\) takes the value 2 with probability 0.01.

In (1) and (2), the probability measure is the same (fair die) but the function defining the random variable is different (sum versus max). In (1) and (3), the function defining the random variable is the same, and the sample space of outcomes is the same, but the probability measure is different.

We often specify the distribution of a random variable directly without explicit mention of the underlying probability space or function defining the random variable. For example, in the meeting problem we might assume that arrival times follow a Uniform(0, 60) or a Normal(30, 10) distribution. In situations like this, you can think of the probability space as being the distribution of the random variable and the function defining the random variable to be the identity function. In other words, we construct a spinner to represent the distribution of the random variable and spin it to simulate a value of the random variable.

Example 3.37 Donny Dont is thoroughly confused about the distinction between a random variable and its distribution. Help him understand by by providing a simple concrete example of two different random variables \(X\) and \(Y\) that have the same distribution. Can you think of \(X\) and \(Y\) for which \(\textrm{P}(X = Y) = 0\)? How about a discrete example and a continuous example?

Solution 3.37. Flip a fair coin 3 times and let \(X\) be the number of heads and \(Y\) be the number of tails. Then \(X\) and \(Y\) have the same distribution, because they have the same long run pattern of variability. Each variable takes values 0, 1, 2, 3 with probability 1/8, 3/8, 3/8, 1/8. But they are not the same random variable; they are measuring different things. If the outcome is HHT then \(X\) is 2 but \(Y\) is 1. In this case \(\textrm{P}(X = Y)=0\); in an odd number of flips it’s not possible to have the same number of heads and tails on any single outcome.

In some cases of the meeting time problem we assumed the distribution of Regina’s arrival time \(R\) is Uniform(0, 60) and the distribution of Cady’s arrival time \(Y\) is Uniform(0, 60). So \(R\) and \(Y\) have the same distribution. But these are two random variables; one measures Regina’s arrival time and one measure Cady’s. If Regina and Cady met every day for a year, then the day-to-day pattern of Regina’s arrival times would look like the day-to-day pattern of Cady’s arrival times. But on any given day, their arrival times would not be the same, since \(R\) and \(Y\) are continuous random variables so \(\textrm{P}(R = Y) = 0\).

A distribution, like a spinner, is a blueprint for simulating values of the random variable. If two random variables have the same distribution, you could use the same spinner to simulate values of either random variable. But a distribution is not the random variable itself. (In other words, “the map is not the territory.”)

Two random variables can have the same (long run) distribution, even if the values of the two random variables are never equal on any particular repetition (outcome). If \(X\) and \(Y\) have the same distribution, then the spinner used to simulate \(X\) values can also be used to simulate \(Y\) values; in the long run the patterns would be the same.

At the other extreme, two random variables \(X\) and \(Y\) are the same random variable only if for every outcome of the random phenomenon the resulting values of \(X\) and \(Y\) are the same. That is, \(X\) and \(Y\) are the same random variable only if they are the same function: \(X(\omega)=Y(\omega)\) for all \(\omega\in\Omega\). It is possible to have two random variables for which \(\textrm{P}(X=Y)\) is large, but \(X\) and \(Y\) have different distributions.

Many commonly encountered distributions have special names. For example, the distribution of \(X\), the number of heads in 3 flips of a fair coin, is called the “Binomial(3, 0.5)” distribution. If a random variable has a Binomial(3, 0.5) distribution then it takes the possible values 0, 1, 2, 3, with respective probability 1/8, 3/8, 3/8, 1/8. The random variable in each of the following situations has a Binomial(3, 0.5) distribution.

  • \(Y\) is the number of Tails in three flips of a fair coin
  • \(Z\) is the number of even numbers rolled in three rolls of a fair six-sided die
  • \(W\) is the number of female babies in a random sample of three births at a hospital (assuming boys and girls are equally likely44)

Each of these situations involves a different sample space of outcomes (coins, dice, births) with a random variable which counts different things (Heads, Tails, evens, boys). But all the scenarios have some general features in common:

  • There are 3 “trials” (3 flips, 3 rolls, 3 babies)
  • Each trial can be classified as “success”45 (Tails, even, female) or “failure”.
  • Each trial is equally likely to result in success or not (fair coin, fair die, assuming boys and girls are equally likely)
  • The trials are independent. For coins and dice the trials are physically independent. For births independence follows from random selection from a large population.
  • The random variable counts the number of successes in the 3 trials (number of T, number of even rolls, number of female babies).

These examples illustrate that knowledge that a random variable has a specific distribution (e.g., Binomial(3, 0.5)) does not necessarily convey any information about the underlying outcomes or random variable (function) being measured. (We will study Binomial distributions in more detail later.)

The scenarios involving \(W, X, Y, Z\) illustrate that two random variables do not have to be defined on the same sample space in order to determine if they have the same distribution. This is in contrast to computing quantities like \(\textrm{P}(X=Y)\): \(\{X=Y\}\) is an event which cannot be investigated unless \(X\) and \(Y\) are defined for the same outcomes. For example, you could not estimate the probability that a student has the same score on both SAT Math and Reading exams unless you measured pairs of scores for each student in a sample. However, you could collect SAT Math scores for one set of students to estimate the marginal distribution of Math scores, and collect Reading scores for a separate set of students to estimate the marginal distribution of Reading scores.

A random variable can be defined explicitly as a function on a probability space, or implicitly through its distribution. The distribution of a random variable is often assumed or specified directly, without mention of the underlying probabilty space or the function defining the random variable. For example, a problem might state “let \(Y\) have a Binomial(3, 0.5) distribution” or “let \(Y\) have a Normal(30, 10) distribution”. But remember, such statements do not necessarily convey any information about the underlying sample space outcomes or random variable (function) being measured. In Symbulate the RV command can also be used to define a RV implicitly via its distribution. A definition like X = RV(Binomial(3, 0.5)) effectively defines a random variable X on an unspecified probability space via an unspecified function.

W = RV(Binomial(3, 0.5))

plt.figure()
W.sim(10000).plot()
plt.show()

Example 3.38 Suppose that \(X\), \(Y\), and \(Z\) all have the same distribution. Donny Dont says

  1. The pair \((X, Y)\) has the same joint distribution as the pair \((X, Z)\).
  2. \(X+Y\) has the same distribution as \(X+Z\).
  3. \(X+Y\) has the same distribution as \(X+X=2X\).

Determine if each of Donny’s statements is correct. If not, explain why not using a simple example.

Solution 3.38. First of all, Donny’s statements wouldn’t even make sense unless the random variables were all defined on the same probability space. For example, if \(X\) is SAT Math score and \(Y\) is SAT reading score it doesn’t makes sense to consider \(X+Y\) unless \((X, Y)\) pairs are measured for the same students. But even assuming the random variables are defined on the same probability space, we can find counterexamples to Donny’s statements.

As just one example, flip a fair coin 4 times and let

  • \(X\) be the number of heads in flips 1 through 3
  • \(Y\) be the number of tails in flips 1 through 3
  • \(Z\) be the number of heads in flips 2 through 4.
  1. The joint distribution of \((X, Y)\) is not the same as the joint distribution of \((X, Z)\). For example, \((X, Y)\) takes the pair \((3, 3)\) with probability 0, but \((X, Z)\) takes the pair \((3, 3)\) with nonzero probability (1/16).
  2. The distribution of \(X+Y\) is not the same as the distribution of \(X+Z\); \(X+Y\) is 3 with probability 1, but the probability that \(X+Z\) is 3 is less than 1 (4/16).
  3. The distribution of \(X+Y\) is not the same as the distribution of \(2X\); \(X+Y\) is 3 with probability 1, but \(2X\) takes values 0, 2, 4, 6 with nonzero probability.

Remember that a joint distribution is a probability distribution on pairs of values. Just because \(X_1\) and \(X_2\) have the same marginal distribution, and \(Y_1\) and \(Y_2\) have the same marginal distribution, doesn’t necessary imply that \((X_1, Y_1)\) and \((X_2, Y_2)\) have the same joint distributions. In general, information about the marginal distributions alone is not enough to determine information about the joint distribution. We saw a related two-way table example in Section 2.5). Just because two two-way tables have the same totals, they don’t necessarily have the same interior cells.

The distribution of any random variable obtained via a transformation of multiple random variables will depend on the joint distribution of the random variables involved; for example, the distribution of \(X+Y\) depends on the joint distribution of \(X\) and \(Y\).

Example 3.39 Consider the probability space corresponding to two spins of the Uniform(0, 1) spinner and let \(U_1\) be the result of the first spin and \(U_2\) the result of the second. For each of the following pairs of random variables, determine whether or not they have the same distribution as each other. No calculations necessary; just think conceptually.

  1. \(U_1\) and \(U_2\)
  2. \(U_1\) and \(1-U_1\)
  3. \(U_1\) and \(1+U_1\)
  4. \(U_1\) and \(U_1^2\)
  5. \(U_1+U_2\) and \(2U_1\)
  6. \(U_1\) and \(1-U_2\)
  7. Is the joint distribution of \((U_1, 1-U_1)\) the same as the joint distribution of \((U_1, 1 - U_2)\)?

Solution 3.39.

  1. Yes, each has a Uniform(0, 1) distribution.
  2. Yes, each has a Uniform(0, 1) distribution. For \(u\in[0, 1]\), \(1-u\in[0, 1]\), so \(U_1\) and \(1-U_1\) have the same possible values, and a linear rescaling does not change the shape of the distribution. Changing from \(U_1\) to \(1-U_1\) essentially amounts to switching the [0, 1] labels on the spinner from clockwise to counterclockwise.
  3. No, the two variables do not have the same possible values. The shapes would be similar though; \(1+U_1\) has a Uniform(1, 2) distribution.
  4. No, a non-linear rescaling generally changes the shape of the distribution. For example, \(\textrm{P}(U_1\le0.49) = 0.49\), but \(\textrm{P}(U_1^2 \le 0.49) = \textrm{P}(U_1 \le 0.7) = 0.7\) Squaring a number in [0, 1] makes the number even smaller, so the distribution of \(U_1^2\) places higher density on smaller values than \(U_1\) does.
  5. No, \(U_1+U_2\) has a triangular shaped distribution on (0, 2) with a peak at 1. (The shape is similar to that of the distribution of \(X\) in ?sec-sim-transform-joint), but the possible values are (0, 2) rather than (2, 8).) But \(2U_1\) has a Uniform(0, 2) distribution. Do not confuse a random variable with its distribution. Just because \(U_1\) and \(U_2\) have the same distribution, you cannot replace \(U_2\) with \(U_1\) in transformations. The random variable \(U_1+U_2\) is not the same random variable as \(2U_1\); spinning a spinner and adding the spins will not necessarily produce the same value as spinner a spinner once and multiplying the value by 2.
  6. Yes, just like \(U_1\) and \(1-U_2\) have the same distribution.
  7. No. The marginal distributions are the same, but the joint distribution of \((U_1, 1-U_1)\) places all density along a line, while the joint density of \((U_1, 1-U_2)\) is distributed over the whole two-dimensional region \([0, 1]\times[0,1]\).

Do not confuse a random variable with its distribution. This is probably getting repetitive by now, but we’re emphasizing this point for a reason. Many common mistakes in probability problems involve confusing a random variable with its distribution. For example, we will soon that if a continuous random variable \(X\) has probability density function \(f(x)\), then the probability density function of \(X^2\) is NOT \(f(x^2)\) nor \((f(x))^2\). Mistakes like these, which are very common, essentially involve confusing a random variable with its distribution. Understanding the fundamental difference between a random variable and its distribution will help you avoid many common mistakes, especially in problems involving a lot of calculus or mathematical symbols.

3.16 Chapter exercises


  1. In most situations we’ll encounter in this book, the “set up” step requires the most work and the most computer programming, while the “simulate” and “summarize” steps are usually straightforward. In many other cases, these steps can be challenging. Complex set ups often require sophisticated methods, such MCMC (Markov Chain Monte Carlo) algorithms, to efficiently simulate realizations. Effectively summarizing high dimensional simulation output often requires the use of multivariate statistics and visualizations.↩︎

  2. Our use of “box models” is inspired by (Freedman, Pisani, and Purves 2007).↩︎

  3. “With replacement” always implies replacement at a uniformly random point in the box. Think of “with replacement” as “with replacement and reshuffling” before the next draw.↩︎

  4. If we’re repeating something many times, do we perform “a simulation” or “many simulations”? Throughout, “a simulation” refers to the collection of results corresponding to repeatedly artificially recreating the random process. “A repetition” refers to a single artificial recreation resulting in a single simulated outcome. We perform a simulation which consists of many repetitions.↩︎

  5. Technically, a Normal(30, 10) distribution allows for values outside the [0, 60] interval, but the probability is small, roughly 0.003.↩︎

  6. We primarily view a Symbulate probability space as a description of the probability model rather than an explicit specification of the sample space \(\Omega\). For example, we define a BoxModel instead of creating a set with all possible outcomes. We tend to represent a probability space with P, even though this is a slight abuse of notation (since \(\textrm{P}\) typically refers to a probability measure).↩︎

  7. The default argument for replace is True, so we could have just written BoxModel([1, 2, 3, 4], size = 2).↩︎

  8. There is an additional argument order_matters which defaults to True, but we could set it to False for unordered pairs.↩︎

  9. We generally use names in our code that mirror and reinforce standard probability notation, e.g., uppercase letters near the end of the alphabet for random variables, with corresponding lowercase letters for their realized values. Of course, these naming conventions are not necessary and you are welcome to use more descriptive names in your code. For example, we could have named the probability space DiceRolls and the random variables DiceSum and DiceMax rather than P, X, Y. Whatever you do, we encourage you to use names that distinguish the objects themselves from their simulated values, e.g. Dice_Sum for the random variable and dice_sum_sim for the simulated values.↩︎

  10. Technically & joins two RVs together to form a random vector. While we often interpret Symbulate RV as “random variable”, it really functions as “random vector”.↩︎

  11. You might try (P & X).sim(10). But P is a probability space object, and X is an RV object, and & can only be used to join like objects together. Much like in probability theory in general, in Symbulate the probability space plays a background role, and it is usually random variables we are interested in.↩︎

  12. The components of an RV can also be accessed using brackets. U1, U2 = RV(P) is shorthand for
    U = RV(P); U1 = U[0]; U2 = U[1].↩︎

  13. We could have actually defined Z = (X - 5) ** 2 here. We’ll see examples where the apply syntax is necessary later.↩︎

  14. We can also define U = RV(P) and then X = U.apply(sum).↩︎

  15. We can also define U = RV(P) and then X = U.apply(max).↩︎

  16. Braces {} are used here because this defines a Python dictionary. But don’t confuse this code with set notation.↩︎

  17. Careful: don’t confuse Uniform(a, b) with DiscreteUniform(a, b). Uniform(a, b) corresponds to the continuous interval \([a, b]\).↩︎

  18. We would typically only include the values observed in the simulation in the summary. However, we include 4 and 8 here because if we had performed more repetitions we would have observed these values.↩︎

  19. “Normalize” is used in the sense of Section 1.3 and refers to rescaling the values so that they add up to 1 but the ratios are preserved.↩︎

  20. For discrete random variables 'impulse' is the default plot type. Like .tabulate(), the .plot() method also has a normalize argument; the default is normalize=True.↩︎

  21. In 10000 flips, the probability of heads on between 49% and 51% of flips is 0.956, so 98 out of 100 provides a rough estimate of this probability. We will see how to compute such a probability later.↩︎

  22. This is difficult to see in Figure 3.12 due to the binning. But we can at least tell from Figure 3.12 that at most a handful of the 100 sets resulted in a proportion of heads exactly equal to 0.5.↩︎

  23. Technically, we should say “a margin of error for 95% confidence of 0.01”. We’ll discuss “confidence” in a little more detail soon.↩︎

  24. In 1,000,000 flips, the probability of heads on between 49.9% and 50.1% of flips is 0.955, and 91 out of 100 sets provides a rough estimate of this probability.↩︎

  25. In 100 million flips, the probability of heads on between 49.99% and 50.01% of flips is 0.977, and 96 out of 100 sets provides a rough estimate of this probability.↩︎

  26. One difference between RV and apply: apply preserves the type of the input object. That is, if apply is applied to a ProbabilitySpace then the output will be a ProbabilitySpace; if apply is applied to an RV then the output will be an RV. In contrast, RV always creates an RV.↩︎

  27. Unfortunately, for techincal reasons, RV(P, count_eq(6) / N) will not work. It is possible to divide by N within RV if we define a custom function def rel_freq_six(x): return x.count_eq(6) / N and then define RV(P, rel_freq_six).↩︎

  28. We will see the rationale behind this formula later. The factor 2 comes from the fact that for a Normal distribution, about 95% of values are within 2 standard deviations of the mean. Technically, the factor 2 corresponds to 95% confidence only when a single probability is estimated. If multiple probabilities are estimated simultaneously, then alternative methods should be used, e.g., increasing the factor 2 using a Bonferroni correction. For example, a multiple of 4 rather than 2 produces very conservative error bounds for 95% confidence even when many probabilities are being estimated.↩︎

  29. In all the situations in this book the repetitions will be simulated independently. However, there are many simulation methods where this is not true, most notably MCMC (Markov Chain Monte Carlo) methods. The margin of error needs to be adjusted to reflect any dependence between simulated values.↩︎

  30. See the footnotes in Section 1.2.1.↩︎

  31. It can be shown that the standard deviation is always at least as big as the average absolute deviation. For a Normal distribution, the average absolute deviation is about 0.8 times the standard deviation.↩︎

  32. If you have some familiarity with statistics, you might have seen a formula for variance or standard deviation that includes dividing by one less than the number values (\(n-1\)). Dividing by \(n\) or \(n-1\) could make a difference in a small sample of data. However, we will always be interested in long run averages, and it typically won’t make any practical difference whether we divide by say 10000 or 9999. We’ll always compute averages by dividing by the number of values that we’re summing.↩︎

  33. Think Pythagorean theorem: it’s \(a^2+b^2=c^2\). On the other hand, for absolute values we only have the triangle inequality \(|a+b| \le |a| + |b|\).↩︎

  34. All of our simulations are based on independently simulated repetitions. More sophisticated methods, such as Markov chain Monte Carlo (MCMC) methods, allow dependent repetitions and can approximate conditional probabilities much more efficiently.↩︎

  35. Age_Snapchat is a pair so Age_Snapchat[0] is the first component (Age) and Age_Snapchat[1] is the second component (Snapchat user).↩︎

  36. In Symbulate, events being conditioned on need to be based on RVs. But the code shows that Symbulate has a pretty broad interpretation of what can be an RV.↩︎

  37. .draw() is like .sim(1) but they have different properties. We will usually work with .sim() to actually run a simulation. .draw() is useful in situations like this where we have to build a custom probability space from scratch.↩︎

  38. Python automatically treats True as 1 and False as 0, so the code (y < 3) effectively returns both True/False realizations of the event itself and 1/0 realizations of the corresponding indicator random variable.↩︎

  39. We wrote this code to compare to the previous block that returned an error. Even though the revised code works, it’s clunky because we unpack the rolls and then rejoin them in random variable world and then unpack them again in simulation world. We could simplify a little by not unpacking and rejoining in random variable variable world, using U = RV(P); u1_and_u2 = U.sim(10). This still isn’t the preferred way to sum the rolls, but we’re just illustrating a few ways to work in both random variable world and simulation world.↩︎

  40. Because of the way we defined count_matches we need to redefine it if we change n and labels, which is why count_matches is defined inside matching_sim.↩︎

  41. The simulation margin of error for relative frequency of at least one match is about 0.01 based on 10000 simulated values. We haven’t discussed simulation margins of error when aproximating expected values yet. In this case, the margin of error for a single average based on 10000 simulated values is about 0.02.↩︎

  42. We might also be interested in other sensitivity analyses, such as what if the objects are not placed uniformly at random in the spots.↩︎

  43. A distribution can also be interpreted as a subjective assessment of relative likelihood or plausibility, but here we’ll focus on the long run relative frequency interpretation.↩︎

  44. Which isn’t quite true.↩︎

  45. There is no value judgment; sSuccess” just refers to whatever we’re counting. Did the thing we’re counting happen on this trial (“success) or not (”failure”). Success isn’t necessarily good.↩︎